RAL Tier1 weekly operations Fabric 20100531
From GridPP Wiki
Contents
Developments
- All:
- Martin:
- Ian:
- Virtualisation platform research
- Managing Performance course
- Helped Tiju set up egee nagios server with Quattor
- Tim:
- Jonathan:
- reset AFS password for user
- responded to request from security re calls from AFS servers to commercial system
- 1 Nagios configuration update
- corrected bug in check_spma.sh plugin
- issued new versions of RPMs tier1-nagios-plugins, tier1-sudo-config and tier1-nrpe-config
- worked on Job Plan for 2010-2011
- James A:
- James T
- Wrote benchmarking document for tender process and ran benchmarks on Viglen 2009 kit to get performance figures.
- Streamline 2009 testing
- Completed job plan
- Deployment allocations
- Started investigating mail forwarding on Quattor systems (taken up by Ian)
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
- gdss380 swapped drives with gdss368. (Intervention)
- gdss423 found faulty backplane. Replaced by Viglen Engineer. (Fixed)
- gdss85 double disk failure.(Intervention)
- gdss321 given back to production.
- gdss332 probably faulty IPMI card.
- gdss153 and gdss165 given back to production.
- Reported Streamline/areca disk servers crashed due to single faulty drive.
- gdss272 three faulty drives. (Replaced all three drives)
- gdss213 two faulty drives. (Back to production)
- Job plan with MJB.
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Cheney leave all week
- Jonathan on leave, Tuesday 25th
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Virtualisation platform development
- CRISTAL 2 preparation
- Reviewing site wide Quattor configurations
- complete job plan
- Tim:
- Cheney
- Jonathan:
- start regular check restores of home filesystem
- complete Job Plan
- continue investigations on setting up AFS directory as Atlas software server
- Nagios configuration updates
- James T:
- Streamline 2009 testing
- Security team work
- Adaptec and LSI support for nagios and the verify system
- CRISTAL feedback for Ian
- Job plan into SSC
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
- gdss423 move back to machine room.
- gdss67 replace memory.
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- All on Bank holiday, Monday 31st May
- Tim on leave 1st-4th June
- James T on leave 7th to 11th June
Fabric On-Call
Ian Primary Tuesday-Sunday