RAL Tier1 weekly operations castor 18/10/2010
From GridPP Wiki
Contents
Work previous week
- Matthew:
- Debugging and fixing zero file sized problems on LHCb
- Remaining 2.1.9 upgrade planning
- Shaun:
- ..
- Chris:
- Castor Facilities work
- Castor on duty person
- Preparation for Gen upgrade
- Richard:
- Prepare for testing GEN instance [ongoing]
- Prepare Quattor structure for "cert in a box" [ongoing]
- Brian:
- ..
- Jens:
- ..
Operations Issues
- The LHCb timeouts were as a result of I/O contention on the database node running the LHCb stager, due to the backup script being located on the same node. The LHCb stager was moved to a different node on 11/10/10 and RAL were unbanned by LHCb afterwards.
- On 12/10/10 neptune4 rebooted, momentarily affecting the LHCb SRMs
- On 13/10/10 the index on id2type on the ATLAS stager got corrupted on and the ATLAS instance had to be brought down between 11:15-12:49 for the index to be rebuilt.
- On 15/10/10 LHCb reported further cases of zero-sized files. This time the cause was an instance of the stager running on the wrong headnode (LSF). The problem was quickly identified and 24 affected files corrected.
Blocking issues
- Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into production
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Update Gen to 2.1.9 | 25/10/2010 08:00 | 27/10/2010 18:00 | Downtime | Gen |
Update CMS to 2.1.9 (STC) | 08/11/2010 08:00 | 10/11/2010 18:00 | Downtime | CMS |
Update ATLAS to 2.1.9 (STC) | 22/11/2010 08:00 | 24/11/2010 18:00 | Downtime | ATLAS |
Advanced Planning
- Upgrade disk servers to 64bit o/s
- Upgrade to 2.1.9-8 after all instances are upgraded to 2.1.9-6
- CASTOR for Facilities instance in production by end of 2010
Staffing
- Castor on Call person: Chris
- Staff absences:
- Shaun (CHEP all week)
- Matthew (Thu PM)
- Jens (Mon-Wed)