RAL Tier1 weekly operations Fabric 20100614
From GridPP Wiki
Revision as of 14:58, 14 June 2010 by Kashif hafeez (Talk | contribs)
Contents
Developments
- All:
- Martin:
- Completed Disk ITT document
- SSC/Oracle training
- HEPSysMan
- Site report for Tier-1
- Disk Failure stats and network configuration
- Ian:
- CRISTAL 2
- Hepsysman & Quattor talk
- Planning for Facilities Castor instance
- Tim:
- T10K migration work
- Modify check_tape_pools scripts
- Push 9940 migration
- Monitor air quality situation
- Jonathan:
- updated RPMs on core servers and Nagios slave servers
- worked on AFS presentation for Fabric team
- Other Peoples’ Business: SSTD Groundstation Event
- issued new versions of RPMs tier1-nagios-plugins, and tier1-nrpe-config
- updated SVN source for RPM tier1-sudo-config (changes made directly on affected servers)
- 2 Nagios configuration updates
- HEPSYSMAN
- James A:
- James T
- On leave all week
- cheney
- tried and failed to fix sls availability stats
- fix dmf spool out of space
- fix samba gone haywire
- investigate hinode website security alerts
- added some servers into nagios
- upgrade tsbn spreadsheet
- patchign of xen dev servers
- bring up dcache xen for testing
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- Streamline 2009 disk servers Testing.
- gdss67 filesystem problem. James A is investigating it. (Intervention)
- gdss469, 473, 474 and 476 replaced raid controller cards with Matt V.
- gdss474 probably faulty backplane arranging Engineer.
- gdss207 tried installing it twice but didn't work.
- gdss390 need to replace fan and memory.
- gdss420 given back to castor.
- Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
Absences
- James T away all week
- Jonathan on partial retirement (not in on Monday and Friday)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Prepare CPU ITT document draft
- Move Repack servers
- Staff review
- Spend plans
- Ian:
- Virtualisation platform planning
- Facilities Castor planning for Quattor
- Away day planning
- Preparation for CERN visit next week
- Tim:
- DMF single copy for BADC backup
- Plan Facilities castor install
- Cheney
- xen dcache testing
- tsbn upgrade to finish off
- Jonathan:
- James T:
- Catchup after leave
- Increase size of /var/lib/ganglia (RAM disk) on ganglia01
- SL09 testing problems
- Think about quattorising /etc/services changes for SL5 disk servers.
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Jonathan on leave 15th (Tuesday) and 17th (Thursday) June
- Martin A/L Friday PM.
- Kashif A/L Tuesday and Thursday.
Fabric On-Call
Ian Primary oncall Monday-Sunday