Difference between revisions of "RAL Tier1 weekly operations Fabric 20090902"
From GridPP Wiki
(No difference)
|
Latest revision as of 10:37, 4 September 2009
Contents
Summary of week gone
Developments
- All
- Martin:
- Procurements
- Security incident overview, kernel updates
- Hardware asset tagging
- Ian:
- Quattor work for SL5 WNs
- James T:
- Continued testing of new Viglen machines.
- Met with Viglen to get a status update on their end of things and to give update on our end, progress is being made.
- Worked on Quattor disk server configuration.
- Fabric on call over the weekend.
- Jonathan:
- updated SSH keys for root userid on farm nodes
- updated /etc/exports on nfs1 to remove NO_ROOT_SQUASH for /home/farm (new home filesystem)
- rebooted several servers after kernel security update
- with Richard investigated problem with rpm command on lcgmon01
- found problems with Quattor installation of lcgbatch01
- updated NIS netgroup to remove old batch workers and add sv08 hosts to t1auto
- fixed mail quota for user
- Nagios configuration updates
- James A:
- Focussed on QUATTOR as much as possible.
- Built support infrastructure for SL5 x86_64 Quattorised WN deployment.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss169 reinitializing array will verify after its completion. (Intervention)
- gdss72 and 126 have been given back to castor.
- lcgpx0620 replaced memory and moved back in UPS room. (Fixed)
- gdss87 added additional raid card. (Failed to update firmware)
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 73, 169, and 243.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
Kernel security problem | 13/08/09 | Ongoing | Critical | All |
Summary of plans for week ahead
Note: Working week is three days Weds-Fri
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|---|---|---|---|---|
UIs, user accessed hosts | Reboots | As soon as new kernels are available | When they've all been updated | All | Down Time (short) |
Development priorities
- All
- Martin:
- Kernel patching
- Procurement issues
- Ian:
- A/L
- James T:
- Liaise with Viglen on disk issues. Next meeting 13.30 on Thursday.
- Quattor work on disk servers and ganglia gmond config.
- Add new disk servers to Overwatch.
- Acceptance tests of 2008 Streamline machines.
- Jonathan:
- reboot sl3 servers after installing new kernel
- check Quattor installation for new batch workers and correct problems
- restart work on new Nagios server
- restart work on new home filesystem server
- Nagios configuration updates as required
- James A:
- Lead WN deployment.
- Ensure backups are working correctly for quattor01
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Create graph of drives failure. (Daily, Weekly and Monthly)
- Continuous working on 2008 disk servers and working nodes.
- Working on gdss67, 152, 169, and 243.
Absenses
- Martin:
- A/L 7-11 Sept
- Ian
- A/L Weds-Fri
Fabric On-Call
- Mon-Sun:
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 44835 – non capacity HW for testing (Services)