RAL Tier1 weekly operations Fabric 20091012
From GridPP Wiki
Revision as of 15:04, 12 October 2009 by James adams (Talk | contribs)
Contents
Summary of week gone
Developments
- All
- Martin:
- Disk procurement ITT evaluation
- Depolyment of 3D databases onto old hardware due to power feed problems making the EMC arrays unstable
- Meeting with Seagate about disk problems
- Ian:
- James T:
- Viglen testing:
- Meeting with
- Drives swapped for a different batch in 10 machines (220 drives).
- Logs captured on 2 October by Seagate showed further issues so they issued another updated firmware.
- More logs captured from timed-out drives on Thursday 8th.
- Tested racks with the functional earth removed - same problems.
- user_xattr mount option rolled out to all CASTOR disk servers.
- Created Storage_CASTOR_Gen ganglia cluster for Brian (former CASTOR team blocking issue).
- Cleaned up some fabric tickets.
- DNS request for repack server.
- HEPSYSMAN on Wednesday 7th (talked about Tier1 storage).
- Viglen testing:
- Jonathan:
- configured nagios@nagger.gridpp.rl.ac.uk as PBS operator
- worked on migration of user home filesystems to new server
- updated RPMs on core servers and rebooted where required
- updated wiki documentation referring to change Nagios master server to nagger
- added new users to Tier1 and AFS
- added new top directory superb for Babar (RT #52070)
- Nagios configuration updates on servers and clients
- James A:
- Lots of work on BatchWorkers in QUATTOR.
- Brought SL5 farm to 90% of KSI2K Capacity.
- Shrunk SL4 farm respectively.
- Made some minor progress with SINDES.
- Some changes to ARTEMIS for UPS room.
- Removed AtlasBackup from base machine template in QUATTOR
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss354 fixed and back in production.
- gdss218 wrong way round backplane cables. (Fixed)
- gdss126 double disks failure. Completed verifying array.
- Seagate 220 drives dispatched, given to Seagate Engineer.
- Completed adding additional raid cards in v06 (Castor disk servers).
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 86, 126 and 170.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday am | not in site | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Disk procurement ITT evaluation
- CPU procurement ITT clarifications
- Ian:
- James T:
- Assign machines for deployment.
- Send out requests for people to complete CRISTAL 2 feedback forms.
- Viglen testing:
- Continue testing latest firmware.
- Prepare to hand over to someone else.
- Jonathan:
- work on migration of Tier1 home filesystem to new server
- work on installing Nagios slave servers using Quattor
- Nagios configuration updates as required
- James A:
- Continue pushing forward with SINDES.
- Take over disk issues from James T.
- Integrate of BMS alerts into ARTEMIS data stream.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on 2008 disk servers and working nodes.
- Continuous Working on gdss67, 86, 126 and 170.
Absences
- James T
- James T on A/L from Thursday 15th until Monday November 2nd.
Fabric On-Call
- Mon-Fri:
Advanced Warning of Requirements and Blocking issues
Services Issues
- Various requests for hardware.