RAL Tier1 weekly operations Fabric 20090605

Summary of week gone

Developments

All
- R89 cabling
- Settling in to R89
Martin:
Ian:
- Scheduling network intervention post-mortem
- Primary on-call over weekend
- Fabric Management Working Group first meeting
- Built secondary Nagios server to monitor network during transitions
- Photos of machine room
James T:
- Discussed and decided on roll-out of scheduled verifies.
- Packaged scheduled verify system, waiting for James A. to make required changes to OverWatch.
- Built ALICE software server (need to test hot swap before handing it to Grid Services)
- Discussed and decided on disk intervention procedure with the CASTOR team and Kashif.
Jonathan:
- completed update to TierOneSystemRecoveryProcedures wiki documentation and updated Strategic Action 61
- fixed problems with various farm nodes
- located files written by Marian describing installation of glite-ui as a virtual machine
- Nagios configuration updates
- got replacement number for pager 1, updated oncall program
- created kickstart files for installation of new Nagios master server, installed and started build of Nagios v3.0.6
- corrected space problem on lcgsql0363 by removing directory /var/lib/mysql/nagios29_sav
- repaired MySQL table nagios_logentries by deleting old entries
- Safety Refresher Training course and R89 Safety Tour
James A:
- Continued to assist engineers where necessary.
- Continued to lay networking for CASTOR/ADS racks with assistance from team.
- Assist with connections of new cables to old equipment in A5 lower to allow migration of systems to R89.
Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss338 ready for deployment.
- gdss160 has been passed acceptance test. Presently verifying.
- gdss125, gdss156 ready for production.
- gdss140 given back to castor.
- gdss212 and gdss255 replaced 8x1gb memory. (Given back to castor)
- Working on gdss73, 139, 151, 196, 198, 207, 211, 102, 205.

Operational Issues and Incidents

Index	Description	Start	End	Severity	Affected VO(s)
	Too few NFS daemons on lcg0616 (CMS s/w server) caused hung mounts on WNs. Fixed quickly and jobs continued to run

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component	Description	Start	End	Affected VO(s)	Type

Development priorities

All
- Cabling in R89
- EScience Away Day (Thursday/Friday) – not Martin/Kash
Martin:
Ian:
- Configure Nagios transitional network monitor
- Further Fabric management project work
- Preliminary look at Morgan Stanley Aquilon tool
- Plan IPMI/management network with James A
James T:
- Primary on call Mon - Thurs.
- Test new disk.
- Recovery documentation.
- Roll out scheduled verify system to non-CASTOR disk servers and non-production CASTOR disk servers (nonProd, spare, test,...).
- Test hot swap on ALICE software server and hand over to Grid Services.
Jonathan:
- add documentation to wiki to explain how to do simple Nagios configuration updates
- continue to build Nagios binaries on nagger
- add timings to plan for migration of Nagios master server to new hardware and Nagios version 3
- Nagios configuration updates as required
- resurrect plan to move home filesystem to new server
- create SL5 version of tier1-sendmail-config RPM
- continue to plan AFS migration from Kerberos 4 to Kerberos 5
James A:
- More work to networking and other systems to allow for equipment moves.
Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on gdss73, 139, 151, 196, 198, 207, 211, 102, 205.

Absences

Kash A/L Friday
All bar Martin & Kash out Thursday/Friday

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

RT# 38567 - Dedicated WN for Alice (SW area + gridftp area):
RT# 40180 - Resurrect PPS hardware
- Three units powered up
RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric

RAL Tier1 weekly operations Fabric 20090605

Contents

Summary of week gone

Developments

Operational Issues and Incidents

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Development priorities

Absences

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools