RAL Tier1 weekly operations Fabric 20090605
From GridPP Wiki
Revision as of 07:19, 11 July 2009 by Martin bly (Talk | contribs)
Contents
Summary of week gone
Developments
- All
- R89 cabling
- Settling in to R89
- Martin:
- Ian:
- Scheduling network intervention post-mortem
- Primary on-call over weekend
- Fabric Management Working Group first meeting
- Built secondary Nagios server to monitor network during transitions
- Photos of machine room
- James T:
- Discussed and decided on roll-out of scheduled verifies.
- Packaged scheduled verify system, waiting for James A. to make required changes to OverWatch.
- Built ALICE software server (need to test hot swap before handing it to Grid Services)
- Discussed and decided on disk intervention procedure with the CASTOR team and Kashif.
- Jonathan:
- completed update to TierOneSystemRecoveryProcedures wiki documentation and updated Strategic Action 61
- fixed problems with various farm nodes
- located files written by Marian describing installation of glite-ui as a virtual machine
- Nagios configuration updates
- got replacement number for pager 1, updated oncall program
- created kickstart files for installation of new Nagios master server, installed and started build of Nagios v3.0.6
- corrected space problem on lcgsql0363 by removing directory /var/lib/mysql/nagios29_sav
- repaired MySQL table nagios_logentries by deleting old entries
- Safety Refresher Training course and R89 Safety Tour
- James A:
- Continued to assist engineers where necessary.
- Continued to lay networking for CASTOR/ADS racks with assistance from team.
- Assist with connections of new cables to old equipment in A5 lower to allow migration of systems to R89.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss338 ready for deployment.
- gdss160 has been passed acceptance test. Presently verifying.
- gdss125, gdss156 ready for production.
- gdss140 given back to castor.
- gdss212 and gdss255 replaced 8x1gb memory. (Given back to castor)
- Working on gdss73, 139, 151, 196, 198, 207, 211, 102, 205.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
Too few NFS daemons on lcg0616 (CMS s/w server) caused hung mounts on WNs. Fixed quickly and jobs continued to run |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Cabling in R89
- EScience Away Day (Thursday/Friday) – not Martin/Kash
- Martin:
- Ian:
- Configure Nagios transitional network monitor
- Further Fabric management project work
- Preliminary look at Morgan Stanley Aquilon tool
- Plan IPMI/management network with James A
- James T:
- Primary on call Mon - Thurs.
- Test new disk.
- Recovery documentation.
- Roll out scheduled verify system to non-CASTOR disk servers and non-production CASTOR disk servers (nonProd, spare, test,...).
- Test hot swap on ALICE software server and hand over to Grid Services.
- Jonathan:
- add documentation to wiki to explain how to do simple Nagios configuration updates
- continue to build Nagios binaries on nagger
- add timings to plan for migration of Nagios master server to new hardware and Nagios version 3
- Nagios configuration updates as required
- resurrect plan to move home filesystem to new server
- create SL5 version of tier1-sendmail-config RPM
- continue to plan AFS migration from Kerberos 4 to Kerberos 5
- James A:
- More work to networking and other systems to allow for equipment moves.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on gdss73, 139, 151, 196, 198, 207, 211, 102, 205.
Absences
- Kash A/L Friday
- All bar Martin & Kash out Thursday/Friday
Fabric On-Call
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 38567 - Dedicated WN for Alice (SW area + gridftp area):
- RT# 40180 - Resurrect PPS hardware
- Three units powered up
- RT# 44835 – non capacity HW for testing (Services)