RAL Tier1 weekly operations castor 20/12/2010

Operations News

Wrong checksums were found to be given to ~30 LHCb files, leading to errors in the rtcopyd log. This was due to a bug (fixed in 2.1.9-9) affecting incompletely transferred files. Since the transfer error was originally sent back to the user, this is not considered as data corruption due to us.
Jobmanager stopped working for LHCb for ~45 minutes. Secondary job managers will be enabled for remaining instances (LHCb, Gen)
aliceTape GC limits were found to be wrong, leading to a lack of spare capacity. These were corrected.
On Friday, the ATLAS instance became very busy. The 6 SRMs coped, but a backlog was created in LSF. The FTS was throttled to prevent further congestion.

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)	Lead by
Update ATLAS disk servers to SL5 64bit (TBC)	17/01/2011 08:00	18/12/2011 16:00	Downtime	ATLAS	MV

CASTOR for Facilities instance in production by end of 2010
Upgrade ATLAS, CMS, Gen disk servers to SL5 64bit and Quattorize the non-Quattorized disk servers
CASTOR certification and upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers