Difference between revisions of "RAL Tier1 weekly operations Fabric 20110110"
From GridPP Wiki
(No difference)
|
Latest revision as of 15:55, 10 January 2011
Editing RAL Tier1 weekly operations Fabric 20110110
Contents
Developments
- All:
- Martin:
- Ian:
- Catching up after holiday
- Setting up new database nodes for testing
- Some work on virtualisation tests
- Fixing repository updates
- Generating new errata templates in Quattor
- Tim:
- James A:
- On leave
- James T
- Post Christmas catch up
- Discovered with Shaun that rsyslog is using TCP for log messages to DLF which may be the cause of unresponsive disk servers.
- Tested update of puppet using quattor
- SL10/V10 acceptance tests
- Investigated the CERN burn-in tests
- Updated documentation
- Cheney
- Fixed nagios on solaris boxen
- Cleared down the errors on servers from the xmas week
- Straighten out and check over backups of various sorts
- Emptied out the Outlook inbox
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss380 still with Streamline for fix.(Crashed with single faulty drive)
- gdss417 acceptance testing. (Crashed with single faulty drive)
- gdss357 replaced memory and Power distribution board with Viglen Engineer.
- Updated wiki for Spares for Xmas period.
- Job plan review.
- Fabric Hardware failure metrics.
- Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
- gdss70 given back to Castor team.
- gdss337 Kernel panic (Faulty memory)
- gdss283 crashed with File system problem.(Intervention)
- gdss68 re-created array but still fail to see replacement drive. (Probably faulty backplane)
- SL 2010 and Viglen 2010 disk servers in testing.
- gdss496 Scsi errors out of production.
- SL 2009 Auto rebuild on hotspare fails.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Finalising and publishing errata templates
- Help Shaun with Facilities Castor configuration
- Virtualisation/iSCSI testing
- Tim:
- Cheney
- DMF rsync setup
- DMF samba users setup
- DMF disaster recovery plan
- Write a mathematical model for disk storage problems
- James T:
- Preparation for ATLAS SL5 64-bit upgrade
- disk servers as iSCSI targets
- Test puppet update on kickstarted disk servers
- Puppet -> Quattor migration on disk servers
- James A:
- Catching up after holiday
- Adding new sensors to Artemis
- Working to get new worker nodes into testing
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Job plan review update.
- SL 2009 Auto rebuild on hotspare fails.
- Hardware failure metrics continue.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Cheney out tues morn.
- James T A/L Thursday
Fabric On-Call
- Kashif Monday - Sunday