RAL Tier1 weekly operations Fabric 20090817
From GridPP Wiki
Contents
Summary of week gone
Developments
- All
- Aircon failure recovery (twice) - ongoing
- Martin:
- Procurements
- Further setup and configuration of SAN, arrays and systems for resilient LFC/FTS/3D Oracle services
- Ian:
- Michel Jouvin's Visit
- Finalised production Quattor server
- Introduced JFW to Quattor
- Committed first fixes back to Quattor trunk
- Work on FP7 bid with Morgan Stanley
- James T:
- Checking the disk servers.
- Jonathan:
- Quattor tutorial and initiation
- obtained and installed renewed host certificates for yumit and pat
- updated Fabric alarm response spreadsheet for loggers and help desk
- updated TierOneOncallProcess page on wiki to correct text of test pager alarm
- corrected root password for several disk servers
- disabled local access to Tier1 systems for non-Tier1 users (security issue)
- fixed mail quota for user
- updated Nagios configuration on nincom and netnag
- James A:
- Implemented new thermal shutdown systems.
- Deployed two new ARTEMIS instances.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Working on Disk servers and working nodes after Air con. problem.
- Working on gdss73, 95, 152, 243 and 256.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
Aircon failure requiring emergency stop of Tier1 | 14:00 Monday 10th | 19:00 Monday 10th | Critical | All | |
Aircon failure requiring emergency stop of Tier1 | 00:30 Wednesday 12th | Ongoing | Critical | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Ongoing recovery of Tier1 service after aircon failures
- Martin:
- Procurements
- Meeting with Arista (switch vendor)
- Ian:
- Work with Derek on Quattorising batch01
- Quattor deployment - further work on RAL specific components
- Further meetings re Quattor FP7 bid
- Primary on call
- James T:
- Quattor templates
- Talk for TDG on Monday 24th.
- Jonathan:
- reboot AFS servers
- update Nagios configurations as required
- release updated RPMs tier1-nagios-plugins and tier1-nrpe-config
- sort out atlasbackups after robot problems
- Quattor work for Nagios
- James A:
- Re-connect and restart batch farm.
- Focus on QUATTOR as much as possible.
- Implement further enhancements to thermal shutdown systems.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on disk servers and working nodes.
- Continuous working on gdss73, 95, 152, 243 and 256.
Absenses
- All:
- Work From Home Wednesday (Except Kash, Martin)
- Jonathan:
- A/L Tuesday
- Ian:
- A/L Friday
Fabric On-Call
- Mon-Sun:
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 44835 – non capacity HW for testing (Services)