RAL Tier1 weekly operations Fabric 20100201
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Strategy meeting
- Martin:
- Minor procurements
- Ian:
- Upgraded CIp filesystem layouts
- Upgraded batch server binaries
- Upgraded kernels on SL5 WNs
- Planning for handover of fabric management for Castor systems
- James T:
- "Mega intervention" preparation/documentation
- Mega Intervention
- Fisrt Viglen '08 disk servers out of testing.
- Ongoing quattorisation of disk servers.
- Primary on call
- Jonathan:
- added new NIS groups and create new pool accounts
- checked SSH problem on lcgdb05; removed special userids oracle, lsfadmin, stage and corresponding groups oinstall, lsfadmin, st from NIS (NIS entries sometimes take precedence over local entries whatever the setting of /etc/nsswitch,conf; this can cause system problems)
- updated RPMs on core systems and rebooted where required
- reconfigured and restarted ntpd on lcgvo0425 (updating ntp RPM can sometimes loose the local NTP configuration)
- Nagios configuration updates
- reinstalled and reconfigured nagios04 after disk replacement
- James A:
- Networking preparations ahead of mega-intervention.
- Added snapshotting feature to cacti weather-map.
- Finished cabling IPMI ports in castor racks B&E.
- Updated certificate on t1pg0373.
- Fixed bug in check_spma for handling rotated logs.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss211 running 7 days acceptance test.
- gdss70 given back to castor. (Fixed)
- gdss77 no display. (Found faulty memory) - Intervention
- gdss87 given back to castor for testing.
- nagios04 replaced drive.
- gdss170 given back to castor.
- Moved switches and cables from R27 with James A.
- Working on 2008 Disk servers and working nodes.
- Working on gdss77, 282 and 364.
Absences
- Jonathan (1/2 day, domestic reasons)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Minor procurements
- Ian:
- Upgrading and reconfiguring CIPs
- Work with Catalin on Quattorising further grid services nodes
- Quattor documentation
- (Re-)Instituting steering group for Fabric automation project
- Researching Virtualisation platform options
- James T:
- Ongoing quattorisation of disk servers.
- Install first Viglen '08 disk servers.
- Writing nagios checks
- Apply WAN tuning
- Jonathan:
- implement cron job with checks to run daily test restores of home filesystem
- complete work on installing Nagios slave server via Quattor
- Nagios configuration updates
- James A:
- Two days of SINDES integration.
- Connect uplinks to CASTOR IPMI switches.
- Ensure IPMI on CASTOR boxes comes up.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- lcglb01 drive replacement. (Hot swap)
- Continuous work (memory replacement) with Cheney.
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss77, 282 and 364.
Absences
- Jonathan - as from week beginning 8th February, changing work pattern to 3 days per week (normally Tuesday, Wednesday, Thursday)
Fabric On-Call
Ian Primary on call
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.