RAL Tier1 weekly operations castor 18/04/2011

Operations News

Transfer failures reported by CMS, clustered across different d/s on Mon/Tue and Tue/Wed nights at very similar times. This correlated to packet loss Nagios test failures, indicating networking problems
On Friday ATLAS reported problems connecting to SRM. The SRM database was badly under-performing under the high load from ATLAS, which were put into ~4 hours of downtime. Locking the statistics improved performance.
New certificate host and key installed on gdss66 (cmsFarmRead) didn't match, resulting in transfer failures. There should be a check that the host and key match when renewing certificates, or deploying new disk servers.

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Has arrived and we are awaiting installation.

Entries in/planned to go to GOCDB

Upgrade of CASTOR clients on WNs to 2.1.10-0
Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Upgrade Facilities instance to 2.1.10-0
Move Facilities instance to new Database hardware running 10g
Upgrade SRMs to 2.10-3 which incorporates
- VOMS support
Start migrating from T10KA to T10KC media later this year
Quattorization of remaining SRM servers
Hardware upgrade and Quattorization of CASTOR headnodes