RAL Tier1 weekly operations Fabric 20100104
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Martin:
- Minor procurements
- Migrating hardware out of A1 Upper
- Rebooted AFS servers
- Ian:
- A/L
- James T:
- Disk server kernel updates on 22 December.
- Job plan updates and review.
- Primary on call over Christmas.
- Fabric on call (on site cover) for the rest of the break.
- Jonathan:
- checked web service on csfmove02 and removed (non-working) configuration for lc experiment
- stopped export of /home/csf on csfnfs02
- retrieved new host certificates for afs1, afs2, afs3, nfs1 (RT# 54098/6/7/5)
- Nagios configuration updates
- worked on active method checking databases monitored by peaceful
- shutdown nincom
- updated RPMs on Nagios slave servers before they were moved to A5Lower machine room)
- James A:
- Snow and A/L.
- Kash:
Absences
- Ian: A/L 21-24/12
- Jonathan: A/L 22/12
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Minor procurements
- Migrating hardware out of A1 Upper
- GridPP4 work
- UPS tests
- Ian:
- test new Quattor config for vobox with Catalin
- Plan lcgbatch01 upgrade for next week
- Assist James T with disk server deployment with Quattor
- James T:
- Catch up
- Viglen 2008 disk progress check up.
- Post-Christmas security status assessment with James A.
- Quattorisation of disk servers.
- Jonathan:
- final checks of change to restrict SSH login on disk servers
- implement active checking of database status on peaceful
- complete work on installing Nagios slave server via Quattor
- Nagios configuration updates
- James A:
- Continue to working to get SINDES operational before mid February.
- Scan of security incidents with JIT.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Catch up with James T.
- Reporting faulty parts/drives which we had during Christmas/New year holidays.
- Arranging collection of faulty parts.
- Continuous decommissioning old batch systems. (R 26)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss70, 94, 127 and 282.
Absences
- Martin: A/L Friday pm
Fabric On-Call
- Ian all week
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.