RAL Tier1 weekly operations castor 19/12/2011

Operations News

As a result of the workaround applied on Thu, the new DB hardware is now in full production for the ATLAS SRM.

On Wed night there were more problems with the DNS server Chilton. This time the DB machines were affected, as their DNS lookup order had not been changed after the previous week's DNS problems.
On Thu morning, ATLAS SRM schema started taking up much more resources than it should. This appears to be a bug within the ORACLE Resource Manager. The SRM schema was moved to the new hardware where it is now running in production.
On Fri morning, CMS and Gen started malfunctioning due to one node in the Pluto database becoming unresponsive and affecting the whole rack.
The CIP stopped publishing new data as it was not reconfigured to point to the new DB after the ATLAS SRM was moved.

Entries in/planned to go to GOCDB none

Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Upgrade SRMs to 2.11 which incorporates VOMS support
Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
Quattorization of remaining SRM servers
Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes