RAL Tier1 weekly operations castor 20/10/2014
From GridPP Wiki
Revision as of 12:09, 17 October 2014 by Christopher Prosser 1e304264ea (Talk | contribs)
Contents
Operations News
- xrootd security advisory with FAX component within xrootd
- SL6 Headnode work progressing well - tested in vcert2, hoping for test in castor vcert next week and production end of Nov.
- Successfully moved Castor atlas/gen stager/srm back to primary db following EMC cache battery replacement - process documentation has been improved to make this smoother in the future
Operations Problems
- Gdss720 - currently draining with a view to finishing over the weekend 18/19 Oct and handing over to Kashif next week (to replace LSI RAID card with LSI engineer).
- Still having difficult deploying 5 servers into LHCb production because of a mellanox network card or quattor issue - fabric to fix
- DB duplicates SRM userfile tables still occurring, less frequently over the last couple of days – test files only
- GDSS648 was out of production for several days, thought to be related to 'Mellanox netowrk' issue.
- Atlas load, high wait I/O on several servers – currently investigating if this can be alleviated by some castor tuning and if that would actually help the VO.
Blocking Issues
- grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.
- LHCb ‘nonprod’ disk servers – still outstanding / waiting on James/fabric [some were having the same issue as the 673 machine (mellanox N/W card issues)]
Planned, Scheduled and Cancelled Interventions
- A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
- Juan further patch castor dbs (PSU patches for Pluto and Juno) – standard change ... TBC
- 2.1.14-14 stress testing in preprod along with new errata
Advanced Planning
Tasks
- Plan to ensure PreProd represents production in terms of hardware generation are underway
- Possible future upgrade to CASTOR 2.1.14-15 post christmas
- Switch from admin machines: lcgccvm02 to lcgcadm05
- New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
- Replace DLF with Elastic Search
- Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
Interventions
Staffing
- Castor on Call person
- Rob
- Staff absence/out of the office:
- Juan Monday-Wednesday
- Bruno Wednesday-Friday and the following 2 weeks