Difference between revisions of "RAL Tier1 weekly operations Fabric 20091123"
From GridPP Wiki
Martin bly (Talk | contribs) |
(No difference)
|
Latest revision as of 13:45, 25 November 2009
Contents
Summary of week gone
Developments
- All:
- Martin:
- Completed CPU ITT evaluation
- More work on EMC arrays problems
- Benchmarking work on Nehalem chip systems
- Work on test nodes for LFC database resilience tests
- Ian:
- Much Work on Quest FP7 bid
- Cleaning up remaining issues with kernel security update
- Quattor tutorial for CAstor
- James T:
- Updated PXE boot images for Viglen
- CRISTAL2 preparation
- Kernel updates
- Fixed several Ganglia problems
- Tried out the OSSEC intrusion detection system
- AoD Wednesday
- Jonathan:
- updated BIOS on 11 sv-08 systems to allow hot swap of soft RAID disks and cleared BIOS logs
- worked on developing backup strategy on core systems to improve resilience
- made final dump of csflnx353 to archive tape and shut system down
- corrected backup check scripts on rhubarb
- started Ganglia monitoring on cpre004 (CIP server)
- added Castor specific userids and groups to NIS
- Nagios configuration updates
- updated RPMs tier1-sudo-config, tier1-nrpe-config and tier1-nagios-plugins
- James A:
- A/L
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss163 double disks failure. (under test)
- gdss368 moved into rack in HPD room with Martin and James T.
- gdss161 given back to castor
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 163 and 282.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Work on evacuating A1 Upper (Castor admin and LSF systems)
- Martin:
- complete CPU ITT evaluation
- testing sample hardware
- install database test boxes
- Ian:
- Wrap up Quattor FP7 bid
- Assist with quattorising new disk servers
- SL5 vobox in quattor
- Look at capacity modelling for Andrew
- James T:
- Progress meeting with Viglen
- Disk server "quattorisation"
- Jonathan:
- further work on developing backup strategy on core systems to improve resilience
- updates to farm to allow migration of Babar functional userids to new home filesystem server
- Quattor implementation for Nagios slave
- security updates to disk servers to prevent general user logins
- Nagios configuration updates
- James A:
- Catching up after A/L.
- Trying to focus on SINDES, theoretically ring-fenced majority of time this week.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss67 return to castor team after finishing test.
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss67, 134, 163 and 282.
Absences
- James T:
- CRISTAL2 course (Wednesday - Friday)
Fabric On-Call
- Mon-Sun: James T (Fabric Mon-Thu, Primary Fri-Sun)
Advanced Warning of Requirements and Blocking issues
Services Issues
- Various requests for hardware.
- Working on various hardware requests for Services team.