RAL Tier1 weekly operations Fabric 20090914
From GridPP Wiki
Contents
Summary of week gone
Developments
- All
- Martin:
- A/L
- Ian:
- James T:
- Ongoing liaison with Viglen over disk issues
- Continued acceptance testing of Streamline nodes
- Kickstart for CASTOR on new Streamline nodes
- Quattor build of SL5/64-bit VO software server
- Jonathan:
- fixed /var filesystem full problem for system consoles
- built RPM tier1-batchinfo (v 1.0-12) without Requires: perl-DB-mysql
- investigated NIS problems
- created final backups for /pool and /pool/machines on csflnx266/270
- fixed problems for 2 users
- Nagios configuration updates
- James A:
- Focussed on QUATTOR as much as possible.
- Deployed all new worker nodes to SL5 64-bit with QUATTOR into new batch farm.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss169 fixed and given back to castor.
- gdss67 moved into Test are for further intervention with the help of James A and T.
- gdss302 has been given back to castor.
- lcgfts02 configured for hotswapping.
- gdss172 replaced raid card battery (Fixed) and has been given back to castor.
- gdss164 replaced two new drives also added additional raid card with James T.
- Labeled (from front) Clustervision 2007 working nodes.
- Created graphs of hardware/drives failure.
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 78, 85, 86, 105, 110 and 243.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
1 | gdss67 RAID failure (RT#49145) | 18/08/09 | Ongoing | Critical | CMS |
2 | gdss164 RAID5 failure (RT#49192) | 19/08/09 | Ongoing | Critical | BaBar |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Catchup
- Procurements
- Next steps in database migration plan
- Ian:
- James T:
- Keep up to speed with Viglen on disk issues and run any tests they need
- Sl5/64-bit software server build
- Quattor: disk servers and ganglia config files
- Tuning on gen disk servers at Brian's request
- Jonathan:
- create list of archived tapes on wiki
- work on migration on NIS servers to new hardware and new version of SL
- add update to /var/yp/nicknames for systems managed by Quattor
- work on plan to move home filesystem to new server
- James A:
- Migration of 75% of batch capacity to SL5.
- Migration of SL4 nodes to new scheduler.
- Provide assistance with WMS quattorisation.
- Troubleshoot deployment of software install boxes.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continue working on 2008 disk servers and working nodes.
- Continue working on gdss67, 78, 85, 86, 105, 110 and 243.
Absences
- Ian
- Monday
- James T:
- A/L Thu-Fri
Fabric On-Call
- Tue-Sun: Ian (Primary on-call)
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 44835 – non capacity HW for testing (Services)