RAL Tier1 weekly operations Fabric 20091116
From GridPP Wiki
Revision as of 15:50, 16 November 2009 by James thorne (Talk | contribs)
Contents
Summary of week gone
Developments
- All:
- Martin:
- Completed Disk Procurement eval
- more work on EMC arrays problems
- CPU ITT evaluation
- Ian:
- Work on Quest FP7 bid
- Rolling out kernel security update on quattor system
- First look at disk failure stats
- James T:
- Updated ganglia configs for Storage_LHCb and Services_Grid
- Viglen disk server problems
- TOASTER prep
- Jonathan:
- reconfigured NIS servers to allow access to shadow map from any port
- check AFS servers for contacts from compromised Manchester system
- BIOS update for sv-08-06 (to be lcgcc-s3-06)
- sorted out problems with atlasbackup for many nodes
- sorted out ntp configuration problem on t1pg0373
- Nagios configuration updates
- updated tier1-nagios-plugins to version 2.0-58
- gave talks about Nagios to Production Team etc
- James A:
- A/L
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss262 replaced 8x1gb memory fixed and back in production.
- gdss67 need to run 7 days test.
- gdss125 given back to castor
- gdss413 replaced 4x2gb memory.(Ready for deployment)
- sl4sys32-sl4sys64 replaced PSU.
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 163 and 282.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | not in sight | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Work on evacuating A1 Upper (Castor admin and LSF systems)
- Martin:
- complete CPU ITT evaluation
- testing sample hardware
- install database test boxes
- Ian:
- Further Quattor FP7 work (last week)
- Finish roll out new kernels on Quattor managed machines
- Kernels on SL4 batch workers
- Work on CPU procurements
- Castor Quattor tutorial
- James T:
- Viglen disk server problems
- CRISTAL2 preparation
- Catch up on helpdesk tickets and other actions
- Disk server kernel updates
- Jonathan:
- Set up regular checks of backups for home filesystem, AFS volumes and MySQL databases
- Quattor implementation for Nagios slave
- update environment for SL5 systems
- updates to farm to allow Babar functional userids to migrate home filesystem
- Nagios configuration updates
- James A:
- A/L
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss67 rebuild from scratch and move in HPD room.
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss67, 163 and 282.
Absences
- James A
- Annual Leave (Mon 9th - Fri 20th).
Fabric On-Call
- Mon-Sun: Ian
Advanced Warning of Requirements and Blocking issues
Services Issues
- Various requests for hardware.