RAL Tier1 weekly operations castor 07/11/2011
From GridPP Wiki
- Preprod now working with stg,lsf & dlf headnodes quattorized
- The new CIP which fixes the tape capacity over-reporting was successfully tested and was rolled out to production today
- ATLAS outage from Oct 29-31 caused by a bad execution plan. Later fixed by a hint provided by Nilo. PM: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111031_Castor_ATLAS_Outage
- On Thursday a problem with the new setup of xroot for aliceDisk was identified and fixed. ALICE were unaffected.
- On Friday morning a second Pluto node was lost after a motherboard burnout.
- On Friday morning the CMS jobmanager activity ceased for approx. 1 hour. It was picked up by the strace script, but it did not provide any useful information. The script now actively restarts the daemon if it is triggered again.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB none
- Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
- Upgrade SRMs to 2.11 which incorporates VOMS support
- Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
- Quattorization of remaining SRM servers
- Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes
- Castor on Call person: Shaun
- Staff absence/out of the office:
- Chris (all week)
- (Thu) Shaun in DL
- (Thu PM) Matthew working from home