RAL Tier1 weekly operations Fabric 20100125
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Strategy away day
- Metrics discussion
- Martin:
- Minor procurements
- Ian:
- Updated Quattor server
- Prepared for CIP upgrade
- Plans for batch01 updates
- Plans and checks for WN kernel updates
- James T:
- "Mega intervention" preparation
- CRISTAL2 support group.
- Progress meeting with Viglen.
- Updates to some systems.
- Ongoing quattorisation of disk servers.
- Jonathan:
- removed old (redundant) directories from /home/tier1
- wrote and test ran script that performs automatic test restores of home filesystem
- updated RPMs on AFS servers and rebooted for new kernel
- added production disk servers gdss380-417 to NIS
- created publically-accessible directory for local Nagios plugins and updated reference at www.gridpp.ac.uk to use this directory
- Nagios configuration updates
- updated NRPE configuration to add entry for check_log plugin
- built alert2rss RPM for SL5 and installed on nagger (with dependencies) to allow notification by RSS to work
- James A:
- Solved rrdtool dependency problems on thor, causing rrdtool to jump from 1.2.x to 1.3.x
- Fixed display issues in ARTMEMIS caused by rrdtool version jump.
- Created diagnostics utility boot menu for Quattorised machines.
- Ported SINDES-CA to SL5 and co-hosted it on the Quattor server, rebuilt RPMs as RAL-specific version and issued first SINDES certificate!
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss380 given back to castor.
- gdss148 added additional raid card. (Upgraded bios and firmware)
- gdss66 replaced raid card memory and given back to castor.
- Memory replacement with Castor team. (Cheney)
- afs2 replaced drive.
- lcgfts01 replaced two drives.
- gdss160 and 143 given back to castor.
- Working on 2008 Disk servers and working nodes.
- Working on gdss70, 282 and 364.
Absences
- Jonathan 1 day Sick Leave
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Tier1 Strategy Day
- Batch and Storage interventions
- Martin:
- Minor procurements
- Network interventions
- Ian:
- CIP upgrades
- lcgbatch01 interventions
- WN kernel upgrades
- other kernel upgrades
- hepix virtualisation working group
- James T:
- Document procedures for "mega intervention".
- Mega intervention
- Ongoing quattorisation of disk servers.
- Work on nagios checks
- Jonathan:
- implement cron job with checks to run daily test restores of home filesystem
- implement change to restrict SSH login on disk servers
- complete work on installing Nagios slave server via Quattor
- update RPMs on various servers
- Nagios configuration updates
- James A:
- Networking moves ahead of mega-intervention.
- Finish last of IPMI cabling in CASTOR racks.
- Finish SINDES-CA tweaks in Quattor.
- Bring up SINDES-secured vhost for Quattor profiles.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous work (memory replacement) with Cheney.
- gdss211 run 7 days acceptance test.
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss70, 282 and 364.
Absences
- Martin A/L Thursday, Friday pm
- Kashif (Thursday - A/L)
Fabric On-Call
JamesT Primary on call Monday Sunday
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.