RAL Tier1 weekly operations castor 20/09/2010
From GridPP Wiki
Work previous week
- Upgrade planning and coordination with Cern
- Coordination of test (certification) upgrade
- Debugging 2.1.9 rmmaster problem resulting in apparent database corruption - was actually corrupted shared memory
- Monitoring LHCb production instance
- Preparation for the test upgrade (installing SRM, moving disk servers, preparing Puppet templates)
- Tested upgrade procedure and writing post upgrade report
- Castor on Duty
- Running 2.1.9 functional tests on CERT instance
- Upgrade pre-planning for CIP - in principle, nothing needs doing, but just to be certain.
- LHCb performance degredation: Still running acceptably, so farm slots raised from 800 to 1000. However, high swapping on 1 SRM caused by half the memory of the other SRM - this will be increased.
- gdss280 went back into production after intervention and soon after started displaying fsprobe errors. Checking checksums on its files showed that there had been data corruption.
- The gdss280 problem shows we need extra testing prior to putting intervention disk servers back into production.
- Any ongoing production problems at present will jepordize the timeline for starting 2.1.9 upgrades at the end of this month.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
|Update LHCb to 2.1.9||27/09/2010 08:00||29/03/2010 18:00||Downtime||LHCb|
- Upgrade to 2.1.9 2010
- Castor on Call person: Matt
- Staff absences: