RAL Tier1 weekly operations Fabric 20090727
From GridPP Wiki
Revision as of 15:02, 27 July 2009 by Martin bly (Talk | contribs)
Contents
Summary of week gone
Developments
- All
- Martin:
- Ian:
- Revised Quattor Workplan
- Production quattor server
- Introducing jrha and jiant to Quattor
- James T:
- TOIL on Monday.
- Worked on problems with testing on Viglen 2008 disk - ongoing
- New loggers rolled out, most hosts using them now. Acceptance machines now using a test disk server as syslog server to prevent flooding of the loggers.
- Jonathan:
- worked on moving logs from csflnx266/270
- assisted Kashif to move marley hardware from R27 to R89
- assigned Tier1 userid for Mayo Agard-Olubu
- Nagios configuration updates
- switched off callouts from netnag
- switched Nagios MySQL databases from lcgsql0363 to nagiosdb
- switched off callouts from nincom whilst sysreq and automate servers were being moved
- James A:
- Started looking at Sindes for IC.
- Went through QUATTOR installs with IC.
- Updated Mimi(c) to use new nagios database.
- Fixed a few Mimi(c) bugs.
- Started investigating interaction between Kernel, TORQUE & MAUI memory limits to understand jobs kills.
- Applied gmetric-df to software servers.
- Added option to pool account creation script to skip batch system checks at install time.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss356 Tested memory no fault found. (Still in intervention)
- gdss128, 150 and 248 given back to castor. (Fixed)
- gdss218, 260 and 269 replaced 8x1gb memory given back to castor. (Fixed)
- lcgpx0619 replaced 2 new drives and update BIOS for hotswapping. (Fixed)
- Working on Viglen and Streamline 2008 disk servers.
- Working on gdss73, 196, 198 and 243.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Further setup and configuration of SAN, arrays and systems for resilient non-Castor Oracle service
- Procurements
- Ian:
- Finalising production quattor server
- Quattor deployment - further work on RAL specific components
- James T:
- Viglen 2008 disk acceptance testing
- Move switch/PDU syslog to new loggers.
- Decommission old loggers.
- Quattor disk server deployment.
- Jonathan:
- A/L
- James A:
- Add RAS's Production Actions to MyActions system.
- Discuss WN desployments with IC.
- Continue learning about TORQUE & MAUI.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on Viglen and Streamline 2008 disk servers.
- Continuous working on gdss73, 196, 198 and 243.
Absenses
- Ian:
- James T:
- Jonathan:
- A/L all week
Fabric On-Call
- Ian (Primary on-Call)
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 44835 – non capacity HW for testing (Services)