RAL Tier1 weekly operations Fabric 20100111
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Martin:
- Work on IPMI networking
- Procurements
- Networking plans for capacity procurements
- UPS bypass test and related activities
- Ian:
- Catching up
- Worked on Quattor managed vobox config with Catalin
- Planning for interventions later in January
- Further updates to CIP configs in Quattor
- James T:
- Christmas catch up
- Viglen 2008 disk progress check up.
- All drives should be swapped out
- Some servers testing over Christmas, the rest started last week.
- Quattorisation of disk servers
- Jonathan:
- compiled list of Fabric core services and servers for GRIDPP4 submission
- cleared space in root directory on lcgsql0363 by removing many kernel RPMs
- fixed problem with atlasbackup on lcgsql0363, lcgdb06, lcgvo0598
- restarted enigma using new version of password cracker
- fixed problem with exclude option of atlasbackup on lcgwms01
- Nagios configuration updates
- corrected problem with Quattor configuration of Nagios slaves
- worked on Nagios active check of database status files on peaceful (to replace existing passive checks)
- 1 day at home due to weather
- James A:
- Working from home due to weather:
- Solved ARTEMIS CA issues and got mutual authentication and (internal) certificate issuing working.
- Corrected job-slot numbers on SL5-SL06 WNs which were causing low job efficiencies. RT#54999
- Created read-only overwatch account for Tier1 dashboard. RT#54960
- Created database and handlers on thor for Building Management System SNMP trap logging, hooks for NAGIOS possible.
- Started sorting and checking user contact data for new database. RT#29217
- Working from home due to weather:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss79 given back to castor. (Replaced 3 faulty drives)
- gdss94 and 127 given back to castor.
- Arranged collection for 30 boxes (faulty parts during christmas/newyear) with Viglen.
- lcgdb14 re-arranged Enginner from Streamline.
- gdss169 given back to castor.
- Working on 2008 Disk servers and working nodes.
- Working on gdss70, 282 and 364.
Absences
- Jonathan (1 day, snow)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Minor procurements
- complete IPMI networking
- planning abd cahnage control activities for pre-datataking period
- Ian:
- Further work on vobox
- glite and repository update configuration in Quattor
- Work on Virtualisation platform for Tier1
- James T:
- Viglen 2008 disk progress check up.
- Post-Christmas security status assessment with James A. (not done last week due to James not getting in).
- Quattorisation of disk servers.
- Fix various ganglia problems
- Work on deploying WAN tuning where not applied for Brian.
- Jonathan:
- final checks of change to restrict SSH login on disk servers
- implement active checking of database status on peaceful
- complete work on installing Nagios slave server via Quattor
- Nagios configuration updates
- James A:
- Try to get SINDES information dissemination functional.
- Continue work on user contact database. RT#29217
- Attempt security status assessment with James T.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- lcgdb14 replace faulty part/memory with engineer.
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss70, 282 and 364.
Absences
- Martin: A/L Fri pm
Fabric On-Call
- Ian primary on call all week.
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.