RAL Tier1 weekly operations Fabric 20090928
From GridPP Wiki
Contents
Summary of week gone
Developments
- All
- Martin:
- Disk procurement ITT evaluation
- CPU procurement ITT drafting
- New RAC for LHCb3D, rebuilding (reformatting) Array2
- Ian:
- Work on Quest FP7 bid
- Updated glite update repositories and templates
- Fixing problem with batch server with James A
- Automation of repository updates and maintenance
- began work on procurements
- James T:
- Applied WAN tuning to gen disk servers
- Enabled extended file system attributes on gdss51 (for CASTOR checksumming)
- Fixed a problem with gmetric_ipmi where it was trying to create Ganglia RRDs with a '/' in the name.
- Drive replacements in Kash's absence.
- Viglen 2008 testing
- Swapped drives in a _good_ and a _bad_ host and restarted tests to reconfirm that the problem followed the disks
- Installed and tested new disk firmware on a rack of machines. Ongoing.
- Telecon with Viglen on Friday
- Jonathan:
- Administrator on Duty (Wednesday)
- deleted 9 databases named cms* from MySQL server
- filesystem quota updates
- with James A completed addition of new userid suprnsgm
- Nagios configuration updates
- worked on configuration of nagger (replacement Nagios master server)
- rebooted nagiosdb to use latest kernel and started gmond service
- updated RPM tier1-nrpe-config for CRL age check on SRM systems
- James A:
- Degugging problems with scheduler.
- Extended QWG system schema to support scaling factors.
- Modularised service node templates.
- Created "sysadmin tools" template with usefull tools for diagnosing node problems.
- Created repo definition for Pakiti 2.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss110 replaced battery fixed and given back to castor.
- gdss127 has been given back to castor.
- gdss243 reinstalled and has been given back to castor.
- gdss443 and gdss448 swapped drives.
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 78, 85, 86, and 270.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Disk procurement ITT evaluation
- CPU procurement ITT finalising
- Meeting with Seagate about disk problems
- Ian:
- Quest FP7 bid meeting
- Complete Automation of repository updates and maintenance
- Work on test batch server
- Further work on procurements
- James T:
- Quattor
- Disk Servers
- Ganglia
- Viglen testing
- Talk at GridPP storage meeting on our testing setup.
- Quattor
- Jonathan:
- complete configuration of nagger and migrate Nagios master service
- migrate home filesystem to nfs1.gridpp.rl.ac.uk (from csfnfs02.rl.ac.uk)
- work on moving Nagios slaves to new hosts managed by Quattor
- work on moving NIS servers to new hosts managed by Quattor
- Nagios configuration updates as required
- James A:
- Look at changes to Nagios database schema for v3.x
- Start looking at SINDES.
- Continue trying to diagnose cpu usage errors with asubset of nodes.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss67, 78, 85, 86, and 270.
Absences
Fabric On-Call
- Mon-Sun: Ian (Primary On-Call)
Advanced Warning of Requirements and Blocking issues
- Kernel patching to fix CVE-2009-2692 and CVE-2009-2698 is required to be completed by the end of this week. For systems where this is not possible, please discuss the issues with me. Kernels required:
- SL3: 2.4.21-60
- SL(C)4: 2.6.9-89.0.9 or better
- SL(C)5: 2.6.21-128.7.1 or better
Services Issues
- RT# 44835 – non capacity HW for testing (Services)