RAL Tier1 weekly operations castor 20/09/2010

Work previous week

Matthew:
- Upgrade planning and coordination with Cern
- Coordination of test (certification) upgrade
- Debugging 2.1.9 rmmaster problem resulting in apparent database corruption - was actually corrupted shared memory
Shaun:
- ..
- Monitoring LHCb production instance
Chris:
- Preparation for the test upgrade (installing SRM, moving disk servers, preparing Puppet templates)
- Tested upgrade procedure and writing post upgrade report
- Castor on Duty
Richard:
- Running 2.1.9 functional tests on CERT instance
Brian:
- ..
Jens:
- Upgrade pre-planning for CIP - in principle, nothing needs doing, but just to be certain.

LHCb performance degredation: Still running acceptably, so farm slots raised from 800 to 1000. However, high swapping on 1 SRM caused by half the memory of the other SRM - this will be increased.
gdss280 went back into production after intervention and soon after started displaying fsprobe errors. Checking checksums on its files showed that there had been data corruption.
The gdss280 problem shows we need extra testing prior to putting intervention disk servers back into production.

PreProd

Any ongoing production problems at present will jepordize the timeline for starting 2.1.9 upgrades at the end of this month.

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)
Update LHCb to 2.1.9	27/09/2010 08:00	29/03/2010 18:00	Downtime	LHCb