RAL Tier1 weekly operations Fabric 20110124
From GridPP Wiki
Revision as of 16:34, 21 February 2011 by Kashif hafeez (Talk | contribs)
Editing RAL Tier1 weekly operations Fabric 20110110
Contents
Developments
- All:
- Martin:
- Ian:
- Tim:
- James A:
- Atlas power off completed with no issues.
- Bulk of ClusterVision WN issues now sorted, now trying to push through acceptance.
- Continued benchmarking.
- James T
- ATLAS SL5 x86_64 upgrade
- End of acceptance tests on Viglen and Streamline 2010 disks
- AFS documentation
- First aid course Wednesday and Thursday
- Cheney
- Atlas power shutdown
- backups performance testing
- huge backups testing
- generate solarb stats
- dmf disaster recovery
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss380 add new mac address in dhcp and re-install.
- gdss189 read-only filesystem.(Scsi errors)
- gdss496 Scsi errors. Reported to Streamline with logs.(Intervention)
- gdss451 replaced 4x2gb memory. (Back into production same day)
- Fabric Hardware failure metrics.
- Jetstor systems multiple drive failures.
- gdss576 and gdss577 getting logs to send to Streamline.
- gdss496 post-mortem.
- lcgwms03 replaced drive with hotswap method.
- gdss283 crashed with File system problem.(Intervention)
- Quattor01 wend down. No problem found.
- SL 2010 and Viglen 2010 disk servers in testing.
- SL 2009 Auto rebuild on hotspare fails.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Tim:
- Cheney
- DMF dr
- James T:
- Disk checks in Kash's Absence
- Review testing of Streamline and Viglen 2010 disks
- Preparation for CMS SL5 x86_64 upgrade
- Streamline 2008 testing
- Project management training Thursday and Friday
- James A:
- Pushing ClusterVision WNs through acceptance testing.
- Hardware support while Kash on leave.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- SL 2009 Auto rebuild on hotspare fails.
- Hardware failure metrics continue.
- SL08 testing.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Kash out Monday (Annual Leave)
- JRHA out Wednesday (Annual Leave)
- James T out Thursday and Friday (project management training)
Fabric On-Call
- Monday - Sunday