RAL Tier1 weekly operations castor 22/11/2010

Work previous week

Matthew:
- CMS 2.1.9 upgrade planning
- Testing during and after CMS upgrade
Shaun:
- ..
Chris:
- Castor Facilities work
- CMS 2.1.9 upgrade
Richard:
- Working on the 4 CIP servers to apply RPM errata and kernel versions. Discovered in the process that one of them would not reboot unattended (which could, of course, have caused problems for an on-call person)
Brian:
- ..
Jens:
- ..

During testing of CMS after the 2.1.9 upgrade, migration policies were initially being ignored and <350 files were migrated to the wrong tape pools. This was fixed before the end of the upgrade, but the files remain in the wrong pools.
During the night 18-19/11/10, a number of CMS disk2disk copying failed, due to a known LSF problem. The problem was fixed on Friday morning. We have modified our instance restart procedures to get around this problem.
On 19/11/10, transfers from cmsWanOut were very slow. This was due to a nigh number of unscheduled disk2disk copying from cmsFarm (51 disk servers) that were swamping network activity on the fewer disk servers in WanOut (5 disk servers). The number of diskcopies were temporarily reduced from 5 to 1.
On 22/11/10, a large number of accesses to hot files on 3 cmsWanOut disk servers created very low data transfer rates. These files were distributed by putting into Draining mode which helped.

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)
Update ATLAS to 2.1.9-6	06/12/2010 08:00	08/12/2010 18:00	Downtime	ATLAS

Deploy new puppetmaster, ideally before ATLAS upgrade
Upgrade ATLAS, CMS, Gen disk servers to 64bit o/s
CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
CASTOR upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
CASTOR for Facilities instance in production by end of 2010