Difference between revisions of "RAL Tier1 weekly operations Fabric 20090824"
From GridPP Wiki
Martin bly (Talk | contribs) |
(No difference)
|
Latest revision as of 14:02, 24 August 2009
Contents
Summary of week gone
Developments
- All
- Martin:
- Procurements
- Installs for new Atlas3D databases
- Work on resillience for LFC/FTS databases
- Discussed future of gdss51 with Castor team
- Security incident overview
- Ian:
- Primary on call most days
- Work on Quattor filesystem configurations
- Deployment of production SL5 batch server
- Quattor FP7 bid planning
- James T:
- Updated disk and 3ware firmware on some of the new Viglen hardware and started testing
- Met with Viglen to get a status update on their end of things and to give update on our end.
- Worked on Quattor gmond configuration
- Jonathan:
- sorted out atlasbackup problems on many systems
- replied to query sent wrongly to security@gridpp.rl.ac.uk (RT #49177)
- added sv-08 systems to NIS group csffarm
- switched AFS glite-sw directory to read-only
- obtained and installed renewed host certificate for csfnfs58 and gdss51
- restarted ntpd on gdss250/289/360
- created Tier1 team userids for new members of staff
- Nagios configuration updates
- issued new versions of RPMs tier1-nagios-plugins and tier1-nrpe-config
- manually edited nrpe.cfg on lcgbatch01 after installation by Quattor
- James A:
- Monday: Brought batch farm back up after air-con failure.
- Rest of week: Off Sick
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss169 double disk failure (Data lost). Replaced new drives. (Intervention)
- gdss95, 151, 154, 168, 169, 214, 256, 280 and 288 have been given back to castor.
- lcgpx0620 moved in Test Area (R89) for further intervention.
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 73, 152, 169, 243 and 202.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
Kernel security problem | 13/08/09 | Ongoing | Critical | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|---|---|---|---|---|
UIs, user accessed hosts | Reboots | As soon as new kernels are available | When they've all been updated | All | Down Time (short) |
Development priorities
- All
- Martin:
- Procurements
- Ian:
- Further Quattor FP7 bid planning
- Work with Michel Jouvin on QWG filesystem templates
- Preparations to deploy New hardware as SL5 batch workers
- James T:
- Talk at TDG, 11.00
- Restart acceptance testing
- Liaise with Viglen on disk issues. Next meeting 11.00 on Tuesday.
- Quattor
- Primary on call Mon - Thurs
- Jonathan:
- add public SSH keys for new members of staff
- Quattor work for Nagios
- Nagios configuration updates
- James A:
- Compile patched kernels for worker-nodes.
- Deploy additional SL5 64bit capacity.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss87 add additional raid card.
- Create graph of drives failure. (Daily, Weekly and Monthly)
- Continuous working on 2008 disk servers and working nodes.
- Working on gdss67, 73, 152, 169, 243 and 202.
Absenses
- Jonathan, Kash:
- A/L Tuesday
Fabric On-Call
- Mon-Sun:
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 44835 – non capacity HW for testing (Services)