RAL Tier1 weekly operations Fabric 20091109
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- SSC HR rollout tasks
- Martin:
- Completed Disk Procurement eval
- Work on EMC arrays problem
- Survey of CPU tender responses
- Meeting with Seagate and Viglen re 2008 disk acceptance
- Ian:
- Work on Quest FP7 bid
- Attended Quattor Workshop (and QUEST F2F)
- Making new kernels available to Quattor managed systems
- James T:
- Catch up
- Viglen Disk Server Problems
- CRISTAL2 preparation
- A/L Friday
- Jonathan:
- updated RPMS and rebooted for new kernel on many systems
- sorted out problems with atlasbackup for some nodes
- sorted out Nagios problems for some servers
- arranged for gdss411-413 to be installed as Castor disk servers prior to deployment
- migrated farm home filesystems from /home/csf to /home/tier1 for sremaining users except for bfactory functional userids)
- Nagios configuration of updates
- with Kash, shutdown nagger and nagiosdb to replace faulty memory in nagger
- James A:
- Quattor Workshop @ Brussels
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss154 and 168 fixed and back in production.
- gdss383 replaced 4x2gb memory. (Fixed) and ready for deployment.
- gdss117, 139 and 154 fixed and given back to castor
- gdss67 after long intervention and efforts (Replaced new 24 ports raid card). I've managed to fix it (Data saved). Need to rebuild it from scratch.
- gdss411, 412 and 413 fixed and ready for deployment.
- nagger replaced memory with Mr. Wheeler.
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 125, 282 and 403.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | not in sight | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Work on evacuating A1 Upper (Castor admin and LSF systems)
- Martin:
- Move EMC kit
- Spares and additional hardware for database arrays
- CPU ITT evaluation
- Testing sample hardware
- Ian:
- Further Quattor FP7 work (last two weeks)
- Roll out new kernels on Quattor managed machines
- Look at disk stats with Kash
- Work on CPU procurements
- James T:
- Progress meeting with Viglen
- CRISTAL2 preparation
- Disk server cover in Kash's absence
- Catch up on helpdesk tickets and meeting actions
- TOASTER preparation
- Jonathan:
- Quattor implementation for Nagios slave
- update environment for SL5 systems
- updates to farm to allow Babar functional userids to migrate home filesystem
- Nagios configuration updates
- James A:
- A/L
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss67 rebuild from scratch and move in HPD room.
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss67, 125, 282 and 403.
Absences
- Jonathan
- A/L on Thursday (12th November)
- James A
- Annual Leave (Mon 9th - Fri 20th).
Fabric On-Call
- Mon-Sun: Ian is Primary On-Call
Advanced Warning of Requirements and Blocking issues
Services Issues
- Various requests for hardware.