RAL Tier1 weekly operations castor 17/10/2011

Operations News

A new diskpool aliceDisk (100TB) went into production (by redeploying existing disk servers from aliceTape) and started being used by Alice for processing jobs
First testing of SRM 2.11 went successfully

A reoccurance of the database corruption problem happened on Tues morning at 08:43. As soon as it was noticed, the fix was applied and we were back in production at 10:00. A new callout Nagios test has been deployed which alerts upon the problem appearing, so we can act even faster if it happens again. A new hypothesis is that a possible cause was draining disk servers. As a precautionary measure, we will no longer be draining out of working hours.
It was discovered that the CIP has been publishing 1024x the real tape capacity since the 2.1.10-1 upgrade - due to changes in CASTOR code.

We need to understand the cause of the new database disk array hardware problem before we can migrate production databases over to it.

Entries in/planned to go to GOCDB none

Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Upgrade SRMs to 2.11 which incorporates VOMS support
Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
Quattorization of remaining SRM servers
Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes