RAL Tier1 weekly operations castor 24/10/2011
From GridPP Wiki
WAN tuning changes were rolled out to approximately half production disk servers on 21st. It remains to be seen whether it has improved trasfer rates.
- 3 CMS disk servers (gdss303,304,305) were found to have a large amount of dark data, as they had been redeployed from another instance with cleanLostFiles being run on them, but not having waited for garbabe collection to run. In future, data partitions of redeployed disk servers will be wiped with "rm -rf" by the CASTOR team to avoid future problems.
- Database hardware problems on Saturday brought down all instances of CASTOR. Service was restored on Sunday after hardware reconfiguration.
- We need to understand the cause of the new database disk array hardware problem before we can migrate production databases over to it.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB none
- Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
- Upgrade SRMs to 2.11 which incorporates VOMS support
- Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
- Quattorization of remaining SRM servers
- Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes
- Castor on Call person: Matthew
- Staff absence/out of the office:
- Matthew at LTUG (Wed) and in DL (Fri)