Difference between revisions of "RAL Tier1 weekly operations Fabric 20100118"
From GridPP Wiki
James adams (Talk | contribs) |
(No difference)
|
Latest revision as of 14:30, 18 January 2010
Contents
Summary of week gone
Developments
- All:
- Martin:
- Procurements
- GridPP4
- Networking plans for capacity procurements
- Intervention on lcgdb14
- Ian:
- Work on Quattor config of vobox
- Planning for batch server upgrade and other interventions
- Planning update of Quattor server
- James T:
- Fixed two problems with Ganglia
- The data sources for the Miscellaneous cluster had been decommissioned.
- Workers_SL5 graphs were fluctuating wildly due to wrongly configured Workers_SL4.
- Quattorisation of disk servers
- fsprobe added
- puppet added
- Work on processing errata
- Various system updates
- Dry run of procedures prior to "Mega Intervention".
- Progress meeting with Viglen on disk testing. All machines now in testing, complete mid-February.
- Fixed two problems with Ganglia
- Jonathan:
- updated CSFadduser script (in /usr/local/sbin on wyatt) for new Tier1 home directory and added new userids for Castor evaluation
- corrected backup problems on several nodes
- followed up chkrootkit problem on afs2
- updated RPMs on several nodes
- investigated Callout problems on several nodes
- Nagios configuration updates
- 2 days out (home emergency)
- James A:
- Finalised plan for SINDES implementation.
- Worked on new user contact database.
- Moved castoradm2 and castoradm3 from A1 upper to A5 lower.
- Begun last of CASTOR rack IPMI cabling.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- lcgdb14 replaced memory and motherboard by Engineer. (Fixed)
- gdss134 given back to castor.
- Produce graphs of hardware failures.
- gdss105 and 171 given back to castor.
- Working on 2008 Disk servers and working nodes.
- Working on gdss66, 70, 282, 364 and 380.
Absences
- Jonathan (2 days - home emergency)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Minor procurements
- planning and chanage control activities for pre-datataking period
- Ian:
- Update Quattor server
- Work with James A to deploy and test Sindes on Quattor server
- Implement CIP config update on Thursday
- Virtualisation platform planning
- James T:
- Document procedure for "mega intervention".
- Ongoing quattorisation of disk servers.
- CRISTAL2 support group.
- ATLAS WAN tuning for Brian.
- Progress meeting with Viglen.
- Updates to some systems.
- Jonathan:
- work on test restore of home filesystem subdirectory
- final checks of change to restrict SSH login on disk servers
- complete work on installing Nagios slave server via Quattor
- update RPMs on various servers
- Nagios configuration updates
- James A:
- Rolling out SINDES.
- Working on user contact database.
- Finishing IPMI cabling.
- Working on forwarding BMS alerts to Nagios.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss380 given back to castor.
- afs2 drive failure.
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss66, 70, 282, 364 and 380
Absences
- Kashif (Thursday - A/L)
Fabric On-Call
JamesT Monday-Thursday Ian Friday-Sunday
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.