RAL Tier1 weekly operations Fabric 20090612

Summary of week gone

Developments

All:
- Away Days
Martin:
- Preparation and plans for move
Ian:
- Configured and tweaked transitional Nagios server
- Planning IPMI/management network with James A
- Investigated network traffic over OPN from CERN
- Dicussed bdii config w Matt H
- Further Quattor work
James T:
- Primary on call Mon - Thurs.
- Rolled out scheduled verify system to non-CASTOR disk servers and non-production CASTOR disk servers (nonProd, spare, test,...), both 3ware- and Areca-based.
- Tested hot swap on ALICE software server and handed over to Grid Services. Just needs another rsync of the data from csfnfs58 when it's put into production.
- Merged wiki pages for CASTOR@RAL and Tier1 Experiments Liaison meetings.
- Made changes to disk deployment docs that were agreed with Chris some time ago.
Jonathan:
- started adding documentation to wiki to explain how to do simple Nagios configuration updates
- worked on method of restoring AFS volume glite-sw if file server (currently afs1) crashes out of hours
- updated Nagios configuration to add new BDII nodes
- repaired MySQL table nagios_logentries
- solved problem of Nagios bleeper not getting mail messages
- followed up problem of mail to valid user being rejected by iCritical after further occurrence (FBU Footprints #19663)
- added times to plan for update of Nagios server
James A:
- More work to networking and other systems to allow for equipment moves.
- Arranged for hand-over of new systems for acceptance testing starting 15th June.
Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss125 and 160 has been given back to castor.
- gdss156 ready for production. (Need to move back into rack)
- gdss205 replaced 8x1gb memory. (Given back to castor)
- gdss139, 142, 145, 149, 151 and 211 has been given back to castor.
- Cabling in R27 with James A.
- Working on gdss73, 196, 198, 207, 102.

Operational Issues and Incidents

Index	Description	Start	End	Severity	Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component	Description	Start	End	Affected VO(s)	Type
Castor	Changing castor disk servers (Compusys05, Viglen06, Viglen07 AMD) to temporary network prior to the machine room migration	18/06 10:00	18/06 11:30	All	At Risk

Development priorities

All
- Preparation for h/w move
- Cabling in R89
Martin:
- Preparation and plans for move
Ian:
- Set up notifications on transitional Nagios server
- Set up basic grid services tests
- Further planning of Fabric Management system
James T:
- Primary on call Mon - Thurs.
- Recovery (and other) documentation.
- Acceptance testing new disk hardware.
Jonathan:
- complete adding simple Nagios configuration documentation to wiki
- continue work building Nagios binaries on nagger
- Nagios configuration updates as required
- complete work on method of restoring AFS volume glite-sw if file server (currently afs1) crashes out of hours and document
- resurrect plan to move home filesystem to new server
- create SL5 version of tier1-sendmail-config RPM
- continue to plan AFS migration from Kerberos 4 to Kerberos 5
James A:
- Working with suppliers to ensure hand-over of new systems into acceptance testing.
- Move remaining ARTEMIS sensors in Atlas to new IPs.
Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on gdss73, 192, 196, 198, 207, 102, 226.

Absences

None

Fabric On-Call

James T

Advanced Warning of Requirements and Blocking issues

Services Issues

RT# 38567 - Dedicated WN for Alice (SW area + gridftp area)
RT# 40180 - Resurrect PPS hardware
- Three units powered up
RT# 44835 – non capacity HW for testing (Services)

Category:RAL Tier1

RAL Tier1 weekly operations fabric

RAL Tier1 weekly operations Fabric 20090612

Contents

Summary of week gone

Developments

Operational Issues and Incidents

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Development priorities

Absences

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools