RAL Tier1 weekly operations Fabric 20090907
From GridPP Wiki
Contents
Summary of week gone
Developments
- All
- Martin:
- Kernel patching
- Procurement issues
- Ian:
- Annual Leave
- James T:
- Ongoing liaison with Viglen over disk issues
- Continued acceptance testing of Streamline nodes
- Kickstart for CASTOR on new Streamline nodes
- Began to put new machines into Overwatch
- Jonathan:
- updated RPMs on several servers and rebooted for new kernel
- corrected /var partition full problem on system consoles
- updated NIS netgroup to add new batch workers and new CE
- prepared local sendmail configuration RPM for SL5 and installed on touch
- cleared up lots of atlasbackup problems (failed to purge old files)
- restarted syslog daemon on a number of servers that were still logging to the old loggers
- Nagios configuration updates
- issued updated version of RPM tier1-nagios-plugins (version 2.0-54)
- 2 days laboratory holidays
- James A:
- Spend most of time working on SL5 WN config in QUATTOR.
- Kash:
- Created spreadsheet of serial numbers etc. of drives that have failed in new Viglen kit.
- Drive replacement.
- Fixing broken WNs.
- gdss169 completed verify. (Intervention)
- gdss332 replaced 4x2gb memory and given back to castor.
- lcg0643 and lcg0658 replaced faulty memory (1x1gb) borrowed from lcg0651 and moved in HPD area. (Fixed)
- gdss81 and 151 given back to castor.
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 169 and 243.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
1 | gdss67 RAID failure (RT#49145) | 18/08/09 | Ongoing | Critical | CMS |
2 | gdss164 RAID5 failure (RT#49192) | 19/08/09 | Ongoing | Critical | BaBar |
3 | Kernel security problem | 13/08/09 | Ongoing | Critical | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|---|---|---|---|---|
ogma | Migration of Oracle database to a 64-bit system. A 3 hour outage has been declared, although it is expected that the database will only be unavailable for a short while at some point during this interval. | 2009-09-08, 09:00:00 | 2009-09-08, 12:00:00 | ?? | Down |
Development priorities
- All
- Martin:
- Annual Leave
- Ian:
- Quattor
- James T:
- Keep up to speed with Viglen on disk issues and run any tests they need
- Restoration of gdss164
- Admin on Duty on Wednesday
- Sl5 64-bit software server build
- Quattor: disk servers and ganglia config files
- Tuning on gen disk servers at Brian's request
- Annual leave on Friday
- Jonathan:
- migrate NIS servers to new hardware and operating system
- migrate Nagios slave servers to new hardware
- Nagios configuration updates
- plan for migration of home filesystem server to new hardware
- plan for migration of mail server to new hardware
- James A:
- Restore two ARTEMIS units in Atlas, moving one to CICT subnet, one to T1 subnet.
- Network cabling for new NC twins.
- Write up power control project plan.
- Give assistance with CE08 when needed.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Create graph of drives failure. (Daily, Weekly and Monthly)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss67, 169 and 243.
Absences
- Martin:
- A/L all week
- James:
- A/L Friday
- A/L 17-18 September
Fabric On-Call
- Mon-Thurs:
- James (Primary Mon - Wed)
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 44835 – non capacity HW for testing (Services)