RAL Tier1 weekly operations castor 12/12/2011

Operations News

NS/VDQM/VMGR successfully upgraded to 2.1.11-8 on certification, and functional tests against the 2.1.10-1 stager. Next step is to upgrade the stager
Hardware for new NS now setup and working

On Tue morning during high load on ATLAS, DB team were alerted to session deadlocks on the SRM schema. Following the established workaround, SRM daemons were restarted on all ATLAS SRMs which fixed the situation. Although there were some FTS transfer failures, we are not aware of users being disrupted.
On Tue late afternoon there were more session deadlock problems, and a GGUS ticket was raised against RAL. On this occasion, the problem disappeared without us doing anything.
On Wed the primary DNS failed and this especially affected ATLAS. After changing the DNS lookup order, the situation improved.
On Thu afternoon the CMS mighunter stopped working for unknown reasons. Investigations continuing.

Entries in/planned to go to GOCDB none

Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Upgrade SRMs to 2.11 which incorporates VOMS support
Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
Quattorization of remaining SRM servers
Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes