RAL Tier1 weekly operations Fabric 20090803
From GridPP Wiki
Contents
Summary of week gone
Developments
- All
- Martin:
- Procurements
- Further setup and configuration of SAN, arrays and systems for resilient LFC/FTS/3D Oracle services
- Meeting with SGI (Rackable)
- Ian:
- Revised Quattor Workplan
- Production quattor server
- Introducing jrha and jiant to Quattor
- Primary on call
- James T:
- Updated the verify scheduling system. It now:
- Takes account of machines in status retired or intervention.
- Sleeps for a variable time to avoid too many connections to the database.
- Crash course on Quattor configuration of disk servers with Ian
- Lots of work on the disk server acceptance testing.
- Passed some of the checking of machines in testing to Kash. Most of the streamline machines come out of testing this week.
- New loggers now running. Just a few machines/switches/PDUs logging to old loggers.
- Updated the verify scheduling system. It now:
- Jonathan:
- A/L
- James A:
- Added RAS's Production Actions to MyActions system.
- Discussed WN desployments with IC.
- Continued learning about TORQUE & MAUI.
- Deployed new MySQL server for MINOS.
- Started network load-testing on new CPU.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss356 replaced 4x2gb memory given back to castor.
- gdss196 replaced 4 ports raid card.
- gdss166 given back to castor.
- lcg1001-1002 replaced power supply. (Fixed)
- gdss391 replaced faulty fan cable. (Fixed)
- gdss392 replaced faulty memory in raid card and one faulty drive. (Fixed)
- gdss410 replaced faulty memory Dimm 1A. (Fixed)
- Working on gdss213. (3 drives failure)
- Working on Viglen and Streamline 2008 disk servers.
- Working on gdss73, 196, 198, 243 and 256.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- AwayDay Wednesday
- Martin:
- Procurements
- Further setup and configuration of SAN, arrays and systems for resilient non-Castor Oracle service
- Ian:
- Finalising production quattor server
- Quattor deployment - further work on RAL specifc components
- Further work on organisation of Quattor templates
- James T:
- Quattor disk server deployment.
- Fabric team "day" and preparation.
- Viglen 2008 disk acceptance testing (mostly Kash though).
- Move switch/PDU syslog to new loggers.
- Decommission old loggers.
- Jonathan:
- Catch up
- reboot AFS servers
- release new versions of tier1-nagios-plugins and tier1-nrpe-config
- Nagios configuration updates
- James A:
- Monday, Thursday & Friday: Focus on QUATTOR.
- Tuesday: Prepare for Fabric Day.
- Wednesday: Fabric Day.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on Viglen and Streamline 2008 disk servers.
- Continuous working on gdss73, 196, 198, 243 and 256.
Absenses
- All: AwayDay Wednesday: NO FABRIC TEAM COVER EXCEPT IN EMERGENCY (power failure, thermonuclear detonation...)
Fabric On-Call
- Mon-Sun: Ian
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 44835 – non capacity HW for testing (Services)