RAL Tier1 weekly operations castor 07/11/2011
From GridPP Wiki
Contents
Operations News
- Preprod now working with stg,lsf & dlf headnodes quattorized
- The new CIP which fixes the tape capacity over-reporting was successfully tested and was rolled out to production today
Operations Problems
- ATLAS outage from Oct 29-31 caused by a bad execution plan. Later fixed by a hint provided by Nilo. PM: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111031_Castor_ATLAS_Outage
- On Thursday a problem with the new setup of xroot for aliceDisk was identified and fixed. ALICE were unaffected.
- On Friday morning a second Pluto node was lost after a motherboard burnout.
- On Friday morning the CMS jobmanager activity ceased for approx. 1 hour. It was picked up by the strace script, but it did not provide any useful information. The script now actively restarts the daemon if it is triggered again.
Blocking Issues
- none
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB none
Advanced Planning
- Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
- Upgrade SRMs to 2.11 which incorporates VOMS support
- Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
- Quattorization of remaining SRM servers
- Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes
Staffing
- Castor on Call person: Shaun
- Staff absence/out of the office:
- Chris (all week)
- (Thu) Shaun in DL
- (Thu PM) Matthew working from home