RAL Tier1 weekly operations Fabric 20090612
From GridPP Wiki
Revision as of 06:56, 11 July 2009 by Martin bly (Talk | contribs)
Contents
Summary of week gone
Developments
- All:
- Away Days
- Martin:
- Preparation and plans for move
- Ian:
- Configured and tweaked transitional Nagios server
- Planning IPMI/management network with James A
- Investigated network traffic over OPN from CERN
- Dicussed bdii config w Matt H
- Further Quattor work
- James T:
- Primary on call Mon - Thurs.
- Rolled out scheduled verify system to non-CASTOR disk servers and non-production CASTOR disk servers (nonProd, spare, test,...), both 3ware- and Areca-based.
- Tested hot swap on ALICE software server and handed over to Grid Services. Just needs another rsync of the data from csfnfs58 when it's put into production.
- Merged wiki pages for CASTOR@RAL and Tier1 Experiments Liaison meetings.
- Made changes to disk deployment docs that were agreed with Chris some time ago.
- Jonathan:
- started adding documentation to wiki to explain how to do simple Nagios configuration updates
- worked on method of restoring AFS volume glite-sw if file server (currently afs1) crashes out of hours
- updated Nagios configuration to add new BDII nodes
- repaired MySQL table nagios_logentries
- solved problem of Nagios bleeper not getting mail messages
- followed up problem of mail to valid user being rejected by iCritical after further occurrence (FBU Footprints #19663)
- added times to plan for update of Nagios server
- James A:
- More work to networking and other systems to allow for equipment moves.
- Arranged for hand-over of new systems for acceptance testing starting 15th June.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss125 and 160 has been given back to castor.
- gdss156 ready for production. (Need to move back into rack)
- gdss205 replaced 8x1gb memory. (Given back to castor)
- gdss139, 142, 145, 149, 151 and 211 has been given back to castor.
- Cabling in R27 with James A.
- Working on gdss73, 196, 198, 207, 102.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|---|---|---|---|---|
Castor | Changing castor disk servers (Compusys05, Viglen06, Viglen07 AMD) to temporary network prior to the machine room migration | 18/06 10:00 | 18/06 11:30 | All | At Risk |
Development priorities
- All
- Preparation for h/w move
- Cabling in R89
- Martin:
- Preparation and plans for move
- Ian:
- Set up notifications on transitional Nagios server
- Set up basic grid services tests
- Further planning of Fabric Management system
- James T:
- Primary on call Mon - Thurs.
- Recovery (and other) documentation.
- Acceptance testing new disk hardware.
- Jonathan:
- complete adding simple Nagios configuration documentation to wiki
- continue work building Nagios binaries on nagger
- Nagios configuration updates as required
- complete work on method of restoring AFS volume glite-sw if file server (currently afs1) crashes out of hours and document
- resurrect plan to move home filesystem to new server
- create SL5 version of tier1-sendmail-config RPM
- continue to plan AFS migration from Kerberos 4 to Kerberos 5
- James A:
- Working with suppliers to ensure hand-over of new systems into acceptance testing.
- Move remaining ARTEMIS sensors in Atlas to new IPs.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on gdss73, 192, 196, 198, 207, 102, 226.
Absences
- None
Fabric On-Call
- James T
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 38567 - Dedicated WN for Alice (SW area + gridftp area)
- RT# 40180 - Resurrect PPS hardware
- Three units powered up
- RT# 44835 – non capacity HW for testing (Services)