RAL Tier1 weekly operations castor 17/10/2011
From GridPP Wiki
- A new diskpool aliceDisk (100TB) went into production (by redeploying existing disk servers from aliceTape) and started being used by Alice for processing jobs
- First testing of SRM 2.11 went successfully
- A reoccurance of the database corruption problem happened on Tues morning at 08:43. As soon as it was noticed, the fix was applied and we were back in production at 10:00. A new callout Nagios test has been deployed which alerts upon the problem appearing, so we can act even faster if it happens again. A new hypothesis is that a possible cause was draining disk servers. As a precautionary measure, we will no longer be draining out of working hours.
- It was discovered that the CIP has been publishing 1024x the real tape capacity since the 2.1.10-1 upgrade - due to changes in CASTOR code.
- We need to understand the cause of the new database disk array hardware problem before we can migrate production databases over to it.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB none
- Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
- Upgrade SRMs to 2.11 which incorporates VOMS support
- Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
- Quattorization of remaining SRM servers
- Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes
- Castor on Call person: Matthew
- Staff absence/out of the office:
- Shaun at EUDAT (Mon-Wed)