RAL Tier1 weekly operations castor 04/10/2010
From GridPP Wiki
Revision as of 15:27, 4 October 2010 by Matt viljoen (Talk | contribs)
Contents
Work previous week
- Matthew:
- LHCb Upgrade and Testing
- Shaun:
- ..
- Chris:
- LHCb Upgrade and Testing
- Castor Facilities work
- Richard:
- Ran the 2.1.9 functional test suite on the upgraded LHCB instance of CASTOR
- Brian:
- ..
- Jens:
- ..
Operations Issues
- Very heavy load on CMS on 27-28/9/10. Requests were throttled back at Fermilab.
- FTS channels were not requested to be turned on after LHCb upgrade and stayed closed until 30/9/10
- gridftp-internal RPMs missing from upgraded 15 upgraded 2.1.9 LHCb disk servers, causing transfers to fail. Fixed on morning of 30/9/10.
- Wrong checksum were written to NS + filesystem attributes after LHCb upgrade. Checksums were turned off on 30/9/10 am. Approx. 1200 file migration backlog due to a number of files having wrong checksums. Checksums were manually deleted afterwards and migration backlog cleared.
- 3 ATLAS SRM server daemons crashed due to unknown reasons at same time on 29/9/10
Blocking issues
none
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Update Gen to 2.1.9 (STC) | 25/10/2010 08:00 | 27/10/2010 18:00 | Downtime | Gen |
Update CMS to 2.1.9 (STC) | 08/11/2010 08:00 | 10/11/2010 18:00 | Downtime | CMS |
Update ATLAS to 2.1.9 (STC) | 22/11/2010 08:00 | 24/11/2010 18:00 | Downtime | ATLAS |
Advanced Planning
- Upgrade to 2.1.9-8 after all instances are upgraded to 2.1.9-6
- CASTOR for Facilities instance in production by end of 2010
Staffing
- Castor on Call person: Matthew
- Staff absences:
- ..