RAL Tier1 weekly operations castor 25/04/2011

Operations News

On 15/4/11 the ATLAS SRM started underperforming due to incorrect execution plans leading to bad performance. ATLAS were put into 6 hours of downtime.
On 20/4/11 the co-location of the NS and ATLAS Schemas on the same node caused problems which made the node unresponsive and affected all users for 1 hour. The NS was moved to another node and the node rebooted.
On 21/4/11 the ATLAS SRM once again became too slow, again because the stats were old and the default execution plan needed changing. ATLAS were put into 2 hours of downtime.
(Gen stager db problems over weekend - details to be completed)

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Has arrived and we are awaiting installation.

Entries in/planned to go to GOCDB

Upgrade of CASTOR clients on WNs to 2.1.10-0
Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Upgrade Facilities instance to 2.1.10-0
Move Facilities instance to new Database hardware running 10g
Upgrade SRMs to 2.10-3 which incorporates
- VOMS support
Start migrating from T10KA to T10KC media later this year
Quattorization of remaining SRM servers
Hardware upgrade and Quattorization of CASTOR headnodes