RAL Tier1 weekly operations castor 18/10/2010

Work previous week

Matthew:
- Debugging and fixing zero file sized problems on LHCb
- Remaining 2.1.9 upgrade planning
Shaun:
- ..
Chris:
- Castor Facilities work
- Castor on duty person
- Preparation for Gen upgrade
Richard:
- Prepare for testing GEN instance [ongoing]
- Prepare Quattor structure for "cert in a box" [ongoing]
Brian:
- ..
Jens:
- ..

The LHCb timeouts were as a result of I/O contention on the database node running the LHCb stager, due to the backup script being located on the same node. The LHCb stager was moved to a different node on 11/10/10 and RAL were unbanned by LHCb afterwards.
On 12/10/10 neptune4 rebooted, momentarily affecting the LHCb SRMs
On 13/10/10 the index on id2type on the ATLAS stager got corrupted on and the ATLAS instance had to be brought down between 11:15-12:49 for the index to be rebuilt.
On 15/10/10 LHCb reported further cases of zero-sized files. This time the cause was an instance of the stager running on the wrong headnode (LSF). The problem was quickly identified and 24 affected files corrected.

Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into production

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)
Update Gen to 2.1.9	25/10/2010 08:00	27/10/2010 18:00	Downtime	Gen
Update CMS to 2.1.9 (STC)	08/11/2010 08:00	10/11/2010 18:00	Downtime	CMS
Update ATLAS to 2.1.9 (STC)	22/11/2010 08:00	24/11/2010 18:00	Downtime	ATLAS