RAL Tier1 weekly operations Fabric 20100726
From GridPP Wiki
Contents
Developments
- All:
- Martin:
- Abortive multipath update and consequential sortout
- Disk ITT evaluation
- Yet more spend planning
- Ian:
- Got first hyper-v vms working
- Provided initial vms for GST testbed
- Two days a/l
- Tim:
- Tape library microcode updated
- Tweaks to repack system (different scheduling policies, extra disks)
- DMF data removal for some second copies
- CMS non-migration investigation
- Jonathan:
- arranged disposal of redundant servers
- wrote archive tapes for several old experiment filesystems
- created new pool accounts for CMS and then fixed related NIS problem problem
- assisted user with AFS and reset his password
- removed Tier1 userid
- reset password for Tier1 user
- 1 Nagios update
- James A:
- Focussing on new Quattor server
- Generally assistance where needed
- Continued learning about BIND and DNS.
- James T
- Configured rsyslog to log to central loggers
- Re-cabled the Streamline 2009 disk servers with James A (thanks to James A)
- Started acceptace tests on Streamline 2009 kit
- Created a CASTOR 2.1.9 disk server build in quattor
- Helped Kash with the "shrinking" of pre-prod disk servers
- Read through some of tender responses
- Two disk servers for Repack
- Cheney
- swapped in replacement robot controller
- rebooted disk arrays on preprod
- reset preprod disk arrays after hard lockup
- tweaked tsbn to pick up data from changed tape servers
- write script to automate restore of database backups for testing
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss78 passed acceptance test and given back to Castor team.
- Replaced 10 drives in Streamline 2009 (Test) disk servers by Gareth (Streamline).
- gdss207 crashed again. (Intervention)
- gdss486 received back from Streamline. (Testing)
- gdss105 and gdss106 assigned to Tim for testing.
- gdss187 fsprobe errors. (Intervention)
- Hardware failure stats/graphs.
- gdss536 and gdss537 replaced Adaptec cards with LSI cards. (Gareth Streamline)
- Preparing Viglen 2006 disk servers with new raid configuration for Castor Preprod.
- Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Disk ITT evaluation
- Ian:
- Further development of services virtualisation testbed
- Support for Castor Quattor configuration
- Planning cernvm-fs testing
- Tim:
- Facilities Castor planning
- ADS futures planning
- Cheney
- set up quatted castor core servers
- Jonathan:
- On leave Tuesday - Thursday, so out all week
- James T:
- Away on Scout Camp all week
- James A:
- Finalise and test new Quattor server
- Planning of CVMFS load testing
- Learning about errata updates in Quattor
- Continue learning about BIND and DNS.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss207 received wrong raid card reported again.
- gdss380 run 7 days acceptance test.
- Look after Streamline 2009 disk servers testing in absence of James T.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Jonathan on leave Tuesday - Thursday
- James T on special leave all week
- Kashif Annual leave on Tuesday.
Fabric On-Call
- Ian Primary oncall all week