RAL Tier1 weekly operations Fabric 20091221
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Tier1 Review as required
- Martin:
- Minor procurements, particularly networking
- Planning to move or decommission hosts in A1 Upper
- Ian:
- James T:
- Met with Viglen regarding disk swap out on 2008 procurement.
- Removed Nincom as a ganglia data source for Services_Monitoring.
- Installed tier1-oldprocesskiller on machines that were missing it.
- Fixed drivemap errors on disk servers.
- Disk replacements while Kash was on leave.
- Updated various boxes.
- Backed up /pool on csfsys{a,b} and copied both backups to sl4sys{32,64}
- Preparation for disk server kernel intervention, including script to install workaround for null pointer dereference vulnerabilities.
- Investigated SL3 repos on disk servers (ongoing).
- Jonathan:
- updated RPMs on AFS servers including new kernel
- disabled login and removed SSH key from root login (for some disk servers) for user after laptop theft
- corrected atlasbackup problem on one node
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- Moved srm0383 from Atlas to R89 Ups room.
- gdss105 and 354 given back to castor.
- gdss171 double disk failure back to castor. (Fixed)
- gdss79 fsprobe started memtest on Friday.
- Moved pallet from to Atlas with MJB.
- Working on 2008 Disk servers and working nodes.
- Working on gdss79 and 282.
Absences
- Jonathan: S/L Mon-Thu
- Ian: S/L Mon-Wed
- Kash: A/L Thu
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Move/decommission hosts in A1 Upper
- Martin:
- Minor procurements
- Ian:
- James T:
- Preparation for disk server intervention on Tuesday 22nd.
- Disk server intervention.
- Job plan bits and bobs.
- Tidy up for Christmas.
- Fabric/on site on call for whole of Christmas break.
- Jonathan:
- Stop export of /home/csf from csfnfs02
- Check for web servers on csfmove02
- Write and test script for active checks of database statuses on peaceful
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss79 and 282.
Absences
- All: Fri 25/12 to Fri 1/1. Back 4/1.
- Ian: Mon-Thu
- Jonathan: Tue
Fabric On-Call
- James T Primary on-call to 28th Dec.
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.