RAL Tier1 weekly operations castor 07/11/2011

Operations News

Preprod now working with stg,lsf & dlf headnodes quattorized
The new CIP which fixes the tape capacity over-reporting was successfully tested and was rolled out to production today

ATLAS outage from Oct 29-31 caused by a bad execution plan. Later fixed by a hint provided by Nilo. PM: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111031_Castor_ATLAS_Outage
On Thursday a problem with the new setup of xroot for aliceDisk was identified and fixed. ALICE were unaffected.
On Friday morning a second Pluto node was lost after a motherboard burnout.
On Friday morning the CMS jobmanager activity ceased for approx. 1 hour. It was picked up by the strace script, but it did not provide any useful information. The script now actively restarts the daemon if it is triggered again.

Entries in/planned to go to GOCDB none

Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Upgrade SRMs to 2.11 which incorporates VOMS support
Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
Quattorization of remaining SRM servers
Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes

Castor on Call person: Shaun
Staff absence/out of the office:
- Chris (all week)
- (Thu) Shaun in DL
- (Thu PM) Matthew working from home