RAL Tier1 weekly operations castor 12/12/2011
From GridPP Wiki
- NS/VDQM/VMGR successfully upgraded to 2.1.11-8 on certification, and functional tests against the 2.1.10-1 stager. Next step is to upgrade the stager
- Hardware for new NS now setup and working
- On Tue morning during high load on ATLAS, DB team were alerted to session deadlocks on the SRM schema. Following the established workaround, SRM daemons were restarted on all ATLAS SRMs which fixed the situation. Although there were some FTS transfer failures, we are not aware of users being disrupted.
- On Tue late afternoon there were more session deadlock problems, and a GGUS ticket was raised against RAL. On this occasion, the problem disappeared without us doing anything.
- On Wed the primary DNS failed and this especially affected ATLAS. After changing the DNS lookup order, the situation improved.
- On Thu afternoon the CMS mighunter stopped working for unknown reasons. Investigations continuing.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB none
- Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
- Upgrade SRMs to 2.11 which incorporates VOMS support
- Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
- Quattorization of remaining SRM servers
- Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes
- Castor on Call person:Shaun
- Staff absence/out of the office: