RAL Tier1 weekly operations Fabric 20100301
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Martin:
- Minor procurements
- Castor databases disaster planning
- New hardware unpacking
- Ian:
- Worked on Quattor core castor server
- Adapted Quattor disk server to use core quattor server base
- Castor handover planning
- James T:
- eScience CA RA course
- Quattor disk servers (problem eventually fixed by Ian)
- Installed all 60 Viglen 08 disk servers with Quattor
- Admin on Duty (2 days)
- 5 x disk servers for deployment for ATLAS
- Created CASTOR_PreProd ganglia instance
- fix for vdt_globus_data_server and grid FTP external kickstart install problems
- Jonathan:
- sorted out atlasbackup problems for several nodes
- rebooted lcgui0358 (user front-end) to solve mount problem
- replaced failed drive on afs3 and despatched to DNUK
- Nagios configuration updates
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss211 reinstalling.
- gdss295 given back to castor.
- gdss364 replaced 16 ports raid card (Borrowed from gdss338)
- lcgce07/nc21 replaced system with spare twin system. (Streamline)
- gdss128 and gdss403 given back to castor.
- Castor servers (cdbc13/cdbd03) moved into test area. (Intervention)
- Moved systems/parts to Atlas A5 lower machine room.
- gdss160 given back to castor.
- Working on gdss211 and 295.
Absences
- Jonathan on partial retirement
- medical appointment/annual leave Tuesday
- sick leave Thursday
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All | |
gdss364 disk controller sick | Friday ~20:00 | Ongoing | Severe | CMS (FarmRead) |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Castor databases planning
- Decommissioning A1 Upper
- Moving development network hardware
- Ian:
- Second installation server
- Deployment of Sindes with James A
- Further work on Quattorisation of castor servers with Chris
- James T:
- Checking over of "lean" disk server with Chris and Ian
- Tier1 Tour preparation
- Deploy drained Viglen 06 to pre-prod (with re-configured arrays)
- Helpdesk ticket blitz
- Jonathan:
- change controls for replacement Nagios slave servers and decommissioned web site
- implement cron job with checks to run daily test restores of home filesystem
- Nagios configuration updates
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss347 replace 4x2gb memory.
- Clear Atlas A5 upper test area.
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on gdss211 and 208.
Absences
Ian out on Thursday
Fabric On-Call
Ian: Primary on call Mon-Sun
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
- Update (2010/03/01): new hardware is now on site and ready to be installed in the rack in R26 A5L.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.