RAL Tier1 weekly operations Fabric 20100906
From GridPP Wiki
Contents
Developments
- All:
- Martin:
- Ian:
- Tim:
- Investiget DMF backup issues
- Monitor repack progress
- Monthly stats
- Investigate enlarging ADS virtual tape size
- Tape drive issues
- Jonathan:
- James A:
- James T (last fortnight)
- GridPP 25
- Returned gdss81 to service.
- Resolved tuning errors in Quattor disk server configuration.
- Liaised with Streamline to arrange swap out of all RAID controllers in the Streamline 2009 procurement.
- Fixed mis-configured chkconfig for (3ware) disk servers in Quattor.
- Work on new central loggers, required prior to Atlas power shutdown and to cope with increased log activity after CASTOR 2.1.9 upgrade.
- Cheney
- untangle quattor gordian knot
- try to fix multipath error msg on preprod db
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss110 replaced 4x1gb memory.
- Replaced 1 drive in Streamline 2009 (Test) disk servers.
- gdss468 replaced drive in port 7 and given back to Castor. (Fixed)
- gdss381 given back to Castor.(Crashed with single faulty drive)
- gdss280 fixed hardware. (Intervention)
- gdss81 file system gone read only.
- gdss470 and gdss475 crashed with heavy load. (Fixed)
- Hardware failure stats/graphs.
- Preparing Viglen 2006 disk servers with new raid configuration for Castor Preprod.
- Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Bank Holiday Monday / Privilege day Tuesday
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Tim:
- Fix DMF backup problems
- Test ADS virtal tape size increase
- Speed up repack
- Facilities castor progress
- Usage numbers for non-lhc systems
- Cheney
- Continue grappling with quattor gordian knot
- Prep buxton for swap-in
- Jonathan:
- James T:
- Cover disk faults in Kash's absence.
- Test area risk assessment.
- Test of CASTOR 2.1.9 upgrade on disk servers.
- Plan move of disk servers to UPS power.
- Preparation of new machines needed before Atlas power shutdown.
- Liaise with Streamline regarding testing of 2009 kit.
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- IPMI web interface configuration with James Thorne.
- gdss470 and gdss475 chase vendor about logs/fault.
- Update daily status of Streamline 2009 disk servers testing.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Fabric On-Call
- Kashif Hafeez