Difference between revisions of "RAL Tier1 weekly operations Fabric 20100208"
From GridPP Wiki
James adams (Talk | contribs) |
(No difference)
|
Latest revision as of 14:50, 8 February 2010
Contents
Summary of week gone
Developments
- All:
- Martin:
- Minor procurements
- Ian:
- Carried out first CIP upgrade and prepared second.
- Tested WN minor upgrade to SL5.4
- Reviewed and updated some local Quattor docs
- Made first contribution to SCDB Quattor docs at LAL
- Reinstituted fabric automation steering group
- James T:
- Viglen '08 disk servers
- 5 installed with production config
- 22 finished testing over the weekend.
- Quattorisation of disk servers
- OPN routing
- SSH lockdown
- rc.local tuning
- Script to import disk servers from hardware database
- Viglen '08 disk servers
- Jonathan:
- Administrator on Duty (Wednesday)
- restarted password cracker on enigma; updated iptables configuration to stop logging dropped packets
- sorted out atlasbackup problems on various nodes
- fixed ntpd process on lcgfts0423
- NIS configuration changes
- installed local NRPE and Ganglia configurations on ccse01
- fixed access to Bfactory disk servers for userid bbdatsrv
- Nagios configuration of updates
- new versions of RPMs tier1-nagios-plugins, tier1-nrpe-config and tier1-sudo-config
- worked on Quattor configuration of Nagios slave servers; reinstalled new slave server (found Quattor bug)
- James A:
- Prepared and shipped equipment for integration at supplier's premises in preparation for delivery.
- QUATTORising various pieces of hand-configured functionality on quattor01 in order to be able to integrate SINDES.
- Network cabling for CASTOR team.
- Kickstart and Quattor trouble-shooting for various people.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss211 completed 7 days acceptance test.
- gdss150 and gdss226 given back to castor. (Fixed)
- gdss77 no display. (Found faulty memory) - Intervention
- nc21 (lcg0280) found faulty memory. - Intervention
- lcglb01 replaced drive with hotswap.
- lcgvo-alice offline sectors started long smart test. (offline mode)
- Moved streamline switches and other parts to (R56)logistics.
- Replaced 9 faulty drives in Viglen 2008 disk servers with Viglen engineer.
- Working on 2008 Disk servers and working nodes.
- Working on gdss77, 282 and 364.
Absences
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Minor procurements
- Ian:
- Further research into virtualization platforms
- Plan for rolling upgrade of WNs to SL5.4
- Further work on integration of Castor fabric management into Fabric team
- James T:
- Quattorisation of disk servers.
- Get CASTOR info directly from Overwatch
- Testing
- Re-install latest tranche (22) of Viglen '08 disk servers.
- Writing nagios checks
- Quattorisation of disk servers.
- Jonathan:
- Administrator on Duty (Wednesday)
- implement cron job with checks to run daily test restores of home filesystem
- complete work on installing Nagios slave server via Quattor
- Nagios configuration updates
- James A:
- Continue with SINDES integration where possible.
- Spend some time developing the Hardware Database with Kash.
- Prepare machine room to accept deliveries of new hardware.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- lcgce07 drive replacement. (Hot swap)
- gdss77 and gdss86 replace 4x 1gb memory. (recently bought by Martin)
- Continuous work (memory replacement) with Cheney.
- Viglen 2006 (8) disk servers for decommissioning/prepod. (Label and configure)
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss77, 130, 282 and 364.
Absences
- Jonathan - as from this week changing work pattern to 3 days per week (normally Tuesday, Wednesday, Thursday)
Fabric On-Call
Ian Mon-Sun
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.