RAL Tier1 weekly operations castor 20/10/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • xrootd security advisory with FAX component within xrootd
  • SL6 Headnode work progressing well - tested in vcert2, hoping for test in castor vcert next week and production end of Nov.
  • Successfully moved Castor atlas/gen stager/srm back to primary db following EMC cache battery replacement - process documentation has been improved to make this smoother in the future

Operations Problems

  • Gdss720 - currently draining with a view to finishing over the weekend 18/19 Oct and handing over to Kashif next week (to replace LSI RAID card with LSI engineer).
  • Still having difficult deploying 5 servers into LHCb production because of a mellanox network card or quattor issue - fabric to fix
  • DB duplicates SRM userfile tables still occurring, less frequently over the last couple of days – test files only
  • GDSS648 was out of production for several days, thought to be related to 'Mellanox netowrk' issue.
  • Atlas load, high wait I/O on several servers – currently investigating if this can be alleviated by some castor tuning and if that would actually help the VO.

Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.
  • LHCb ‘nonprod’ disk servers – still outstanding / waiting on James/fabric [some were having the same issue as the 673 machine (mellanox N/W card issues)]

Planned, Scheduled and Cancelled Interventions

  • A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
  • Juan further patch castor dbs (PSU patches for Pluto and Juno) – standard change ... TBC
  • 2.1.14-14 stress testing in preprod along with new errata

Advanced Planning

Tasks

  • Plan to ensure PreProd represents production in terms of hardware generation are underway
  • Possible future upgrade to CASTOR 2.1.14-15 post christmas
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
  • Replace DLF with Elastic Search
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions


Staffing

  • Castor on Call person
    • Rob


  • Staff absence/out of the office:
    • Juan Monday-Wednesday
    • Bruno Wednesday-Friday and the following 2 weeks