RAL Tier1 weekly operations Fabric 20091207
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Martin:
- Ian:
- Finalised new flex license servers for LSF
- Further Quattor tutorial for Cheney & Matt
- Provided physical hardware for CIP
- Went through various configuration options
- James T:
- Quattorisation of disk servers
- Primary on call Mon - Thurs
- Viglen disk swap out support
- Post-mortem on gdss138
- Jonathan:
- maintained NIS netgroup
- corrected atlasbackup problems for a few hosts
- Administrator on Duty (Wednesday)
- unmounted /home/csf from lcg0617/618
- Nagios configuration updates
- system tuning of nagger to try to reduce scheduling queue
- installed RPM mrtg on nagger and added configuration to collect performance statistics from Nagios (see http://nagger.gridpp.rl.ac.uk/mrtg/nagios-[a-n].html at present)
- James A:
- Achieved a working preliminary SINDES server.
- Upgraded Cacti on thor.gridpp.rl.ac.uk
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss138 double disks failure. (Intervention)
- gdss149, 162, 163 and 367 given back to castor.
- gdss77 kernel panic possibly faulty memory. (Intervention)
- gdss139 given back to castor.
- Moved 3 batch systems from R27 to HPD room (CV 2005 rack) with MJB.
- Working on 2008 Disk servers and working nodes.
- Working on gdss77, 138 and 282.
Absences
- Jonathan: S/L Monday
- Jonathan: A/L Thursday am
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All | |
Gdss138 double disk failure: two drives failed in quick succession (30 minutes) | Monday 0530-0600 | Ongoing | Severe | LHCb Dst data. Data loss confirmed |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|---|---|---|---|---|
Cacti (http://cacti.gridpp.rl.ac.uk) | Upgrade Cacti software. Subject to team manager's approval of plan. | Tuesday 2009-12-08 13:00 | Tuesday 2009-12-08 17:00 | none | At Risk |
Development priorities
- All
- Work on evacuating A1 Upper (Castor LSF/FlexLM triplet)
- Martin:
- Ian:
- Reconfigure physical CIP again
- Implement second CIP with Quattor (T2K)
- Start work on Quattor managed glite 3.2 vobox with Catalin
- Assist with new disk servers as required
- Incorporation of latest QWG template updates
- James T:
- Quattorisation of disk servers
- Remove nincom as Ganglia data source for Services_Monitoring
- Script to compare Overwatch with real CASTOR status
- TOASTER preparation
- Jonathan:
- Quattor implementation for Nagios slave
- security updates to disk servers to prevent general user logins
- Nagios configuration updates
- James A:
- Continue with SINDES.
- Upgrade Cacti on cacti.gridpp.rl.ac.uk, install plugins and apply internal patches.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss339 kernel panic.(Intervention)
- gdss138 double disk failure. (Intervention)
- Decommissioning old batch systems with Production Team.
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss77, 138, 282 and 339.
Absences
- Kashif: A/L Wednesday
Fabric On-Call
- Mon-Sun: Ian Primary on call
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on various hardware requests for Services team.