RAL Tier1 weekly operations castor 08/11/2010
From GridPP Wiki
Contents
Work previous week
- Matthew:
- ATLAS permissions testing
- Change control for updating ATLAS permissions
- Change control for upgrading ATLAS SRMs
- Change control for upgrading disk servers
- CoD work
- Shaun:
- ..
- Chris:
- Castor Facilities work
- Testing 64-bit disk servers in preProd
- Doing work for Gen Repack
- Richard:
- Running stress tests on pre-prod and facilities instances of CASTOR
- Brian:
- ..
- Jens:
- ..
Operations Issues
- On 1/11/10 the ATLAS SRMs were repeatedly crashing, caused by a new unsupported command being passed to them (statusOfBringOnlineRequest). The SRMs were upgraded from 2.8-2 to 2.8-6 on 2/11/10 and the problem hasn't reoccurred.
- There was a problem reported during the night of 2-3 Nov with CE SAM tests timing out when trying to use the CMS Castor instance. This appears to be a recurrence of a problem whereby CASTOR is very busy doing Disk-to-Disk copies. CMS have further limited PhEDEx from staging too many files too quickly.
Blocking issues
- Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into production
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Upgrade disk servers to Quattorized SL5 64bit and replace SRM hardware | 10/11/2010 08:00 | 10/11/2010 17:00 | Downtime | LHCb |
Upgrade SRM hardware and add a new SRM (STC) | 11/11/2010 11:00 | 11/11/2010 13:00 | At Risk | LHCb |
Update CMS to 2.1.9-6 | 16/11/2010 08:00 | 18/11/2010 18:00 | Downtime | CMS |
Update ATLAS to 2.1.9-6 (STC) | 06/12/2010 08:00 | 08/12/2010 18:00 | Downtime | ATLAS |
Advanced Planning
- Upgrade all disk servers to 64bit o/s
- CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
- CASTOR upgrade to the latest 2.1.9 which incorporates the fix for grid-ftp-internal to support multiple service classes, enabling checksums for Gen
- CASTOR for Facilities instance in production by end of 2010
Staffing
- Castor on Call person: Chris
- Staff absence/out of the office:
- ..