RAL Tier1 weekly operations castor 03/01/2011
From GridPP Wiki
Contents
Operations News
- ..
Operations Issues
- On 27/12/10 large number of pending atlasSimStrip jobs in LSF. Majority were ‘read’ jobs from single disk server gdss488
- On 28/12/10 atlas srm var partitions were overfilling, had to removed old srm logs. Continue to have large number of pending jobs in atlas LSF
- On 29/12/10 atlas lsf partition was overfilling, had to removed old logs. Continue to have large number of pending jobs in atlas LSF
- On 29/12/10 recovering puppetMaster from heavy load
Blocking issues
- Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) | Lead by |
---|---|---|---|---|---|
Update ATLAS disk servers to SL5 64bit | 17/01/2011 08:00 | 18/12/2011 16:00 | Downtime | ATLAS | MV |
Advanced Planning
- CASTOR for Facilities instance in production by end of 2010
- Upgrade ATLAS, CMS, Gen disk servers to SL5 64bit and Quattorize the non-Quattorized disk servers
- CASTOR certification and upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
- CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
Staffing
- Castor on Call person: Shaun
- Staff absence/out of the office:
- Matthew A/L