RAL Tier1 weekly operations Fabric 20090619
From GridPP Wiki
Revision as of 21:42, 7 July 2009 by Martin bly (Talk | contribs)
Contents
Summary of week gone
Developments
- All:
- Move preparation
- Martin:
- Preparation and plans for move
- Ian:
- Fabric Management Plan
- Preparation for move
- Work on Quattor w. Derek
- Deployed first BDII
- Finalising Netnag nagios monitoring for move
- James T:
- Fabric on call Mon – Thurs
- Added a "force" feature to the verify system for use during interventions if necessary
- Move worksheet completion
- Disk server pre/post move procedures
- Script to quiesc CASTOR/LSF on disk servers.
- Jonathan:
- completed tests on method of restoring AFS volume glite-sw if file server (currently afs1) crashes (now needs to be documented)
- on sl4sys32 installed RPMs tier1-yum-lcg-ca-certs, ca_UKeScienceRoot-2007 and ca_ UKeScienceCA-2007, and changed /etc/krb5.conf to allow use of Kerberos authentication for access to SVN repository
- Nagios configuration update
- repaired MySQL table nagios_logentries
- created configuration for minimal Nagios server (netnag)
- James A:
- Working with suppliers to ensure hand-over of new systems into acceptance testing.
- Moved some of the remaining ARTEMIS sensors in Atlas to new IPs.
- Continued laying network cables in preparation for rack moves.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Worked with Streamline Engineer in R89.
- gdss156 ready for production. (Need to move back into rack)
- gdss81 two drives failure (Replaced)and given back to castor.
- Cabling in R89 with James A.
- Working on gdss73, 196, 198, 207, 102.
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
gdss245 - Read-only / file system. Kash couldn't find a hardware fault. It was re-installed and had verifies turned on to try and weed out any problems |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
None |
Development Priorities
- All
- Move
- Martin:
- Move
- Ian:
- Quattor/QWG work
- James T:
- Primary on call Mon - Thurs.
- Recovery (and other) documentation.
- Acceptance testing new disk hardware.
- Jonathan:
- James A:
- Monitoring relocation of Streamline systems to R89.
- Continue laying network cables in preparation for CASTOR rack moves.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on gdss73, 192, 196, 198, 207, 102, 226.
Absences
- JW: A/L Monday
Fabric On-Call
- Mon-Thur: James T
- Fri-Sun: Ian
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 38567 - Dedicated WN for Alice (SW area + gridftp area):
- Ongoing
- RT# 40180 - Resurrect PPS hardware
- Three units powered up
- RT# 44835 – non capacity HW for testing (Services)