RAL Tier1 weekly operations castor 02/05/2011

Operations News

On 22-23/4/11 the Gen stager database ran into problems with internal memory. The exact cause is unknown and is being followed up with ORACLE.
On 25/4/11 LHCb tape servers went into UNKNOWN status for 24 hours due to unknown reasons, creating a very large backlog of unmigrated files which in turn caused the lhcbDst service class to run out of space. Other available tape servers were re-assigned to LHCb and LHCb activity on the farm was reduced to improve matters.
On 28/4/11 LSF jobs on gdss457 (atlasScratchDisk) were timing out, resulting in failed reads/writes to this disk server. This appeared to be caused by an LSF problem - after killing the jobs it continued as usual.

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Has arrived and we are awaiting installation.

Entries in/planned to go to GOCDB

Upgrade of CASTOR clients on WNs to 2.1.10-0
Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Upgrade Facilities instance to 2.1.10-0
Move Facilities instance to new Database hardware running 10g
Upgrade SRMs to 2.10-3 which incorporates VOMS support
Start migrating from T10KA to T10KC media later this year
Quattorization of remaining SRM servers
Hardware upgrade and Quattorization of CASTOR headnodes