RAL Tier1 weekly operations castor 20/09/2010
From GridPP Wiki
Contents
Work previous week
- Matthew:
- Upgrade planning and coordination with Cern
- Coordination of test (certification) upgrade
- Debugging 2.1.9 rmmaster problem resulting in apparent database corruption - was actually corrupted shared memory
- Shaun:
- ..
- Monitoring LHCb production instance
- Chris:
- Preparation for the test upgrade (installing SRM, moving disk servers, preparing Puppet templates)
- Tested upgrade procedure and writing post upgrade report
- Castor on Duty
- Richard:
- Running 2.1.9 functional tests on CERT instance
- Brian:
- ..
- Jens:
- Upgrade pre-planning for CIP - in principle, nothing needs doing, but just to be certain.
Operations Issues
- LHCb performance degredation: Still running acceptably, so farm slots raised from 800 to 1000. However, high swapping on 1 SRM caused by half the memory of the other SRM - this will be increased.
- gdss280 went back into production after intervention and soon after started displaying fsprobe errors. Checking checksums on its files showed that there had been data corruption.
- The gdss280 problem shows we need extra testing prior to putting intervention disk servers back into production.
PreProd
- ..
Blocking issues
- Any ongoing production problems at present will jepordize the timeline for starting 2.1.9 upgrades at the end of this month.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Update LHCb to 2.1.9 | 27/09/2010 08:00 | 29/03/2010 18:00 | Downtime | LHCb |
Advanced Planning
- Upgrade to 2.1.9 2010
Staffing
- Castor on Call person: Matt
- Staff absences:
- ..