RAL Tier1 weekly operations Fabric 20091102
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Martin:
- HEPiX
- Ian:
- HEPiX
- James T:
- A/L
- Jonathan:
- updated NIS netgroup lcghosts
- sorted out problems with atlasbackup for some nodes (at least twice for some)
- migrated farm home filesystems from /home/csf to /home/tier1
- Nagios configuration updates
- removed netnag from Sure and callout script
- updated iptables on nodes to allow Nagios monitoring to work
- updated RPM tier1-nagios-plugins and distributed via touch and Quattor
- worked on Quattor configuration for slave server
- rebooted nagger after networking stopped
- James A:
- 95% of time spent on disk server problems.
- 5% helping people with Quattor.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss140 and gdss143 fixed and back in production.
- gdss318 replaced 4x2gb memory. (Fixed)
- gdss126 fixed and given back to castor
- Moved touch and scrooge from Atlas (A1 upper) with James A in R89.
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 86 and 168.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | not in sight | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Catch up from week at HEPiX
- Finalise Disk procurement ITT evaluation
- Start on CPU procurement ITT evaluation
- EMC array debarcle
- Ian:
- Catch up from week at HEPiX
- Quattor workshop in Brussels
- James T:
- Catch up.
- Take over disk server testing from James.
- Preparation for CRISTAL 2.
- Jonathan:
- Quattor implementation for Nagios slave
- check on environment for SL4/SL5 systems
- assist Babar to migrate home filesystems
- Nagios configuration updates
- James A:
- Handing disk server problems back to JIT.
- Quattor workshop in Brussels.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss154 two drives failure. (Intervention)
- Continuous working on 2008 disk servers and working nodes.
- Continuous Working on gdss67, 86 and 168.
Absences
- Ian
- Quattor workshop (Tues 3rd - Thurs 5th)
- James T
- (Possibly) A/L Friday 6th
- Jonathan
- A/L on Wednesday (4th November)
- James A
- Quattor workshop (Tues 3rd - Thurs 5th)
- Annual Leave (Mon 9th - Fri 20th).
Fabric On-Call
- Mon-Sun: Martin
Advanced Warning of Requirements and Blocking issues
Services Issues
- Various requests for hardware.