RAL Tier1 weekly operations Fabric 20100308
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Martin:
- C300 procurement
- Development and management network planning
- Decommissioning R26 A1 Upper
- Meetings
- Ian:
- Refined Quattor Disk server config
- Refined prototype Quattor core castor server config
- Provision of additional h/w for LFC boxes plus initial setup
- Set up second repository server
- James T:
- Draft blog post on the Viglen 08 disk procurement saga (not to be published until we've agreed wording with Viglen, et al.)
- Fixed some problems with older kickstarts
- Worked on documentation for Quattor deployment of disk servers
- Created second draft of Tier1 tour structure
- Fixed /var on gdss208
- Jonathan:
- sorted out atlasbackup problems for several servers
- wrote added change control forms for planned changes
- added install02 to NIS netgroup and Nagios
- dropped MySQL database csf_monitor from lcgsql0363
- started work on disposals from A1 Upper
- started work on clearing out old filesystems on csfnfs58
- deleted 8K+ old backups of AFS volumes from Datastore
- Nagios configuration updates
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss211 given back to castor.
- gdss208 replaced ‘power distribution board’ with Viglen engineer fixed and given back to castor.
- Cleared A5 upper test room.
- gdss347 replaced 4x2gb memory fixed and back into production.
- gdss135 given back to castor.
- Castor servers (cdbc13/cdbd03) still working. (Intervention)
- Replaced drive in afs1. (Fixed)
- Replaced memory in lcgftm0430. (fixed)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Networking
- Completing C300 procurement
- Installing kit in R26 A5L and R89 UPS room
- Ian:
- Bring second repository server into production
- Further work on Quattorised Castor servers
- Further assistance for Catalin on configuring LFC boxes
- Researching better tests of raid hardware in nagios
- James T:
- Quattor
- Finish deployment documentation after changes discussed with Chris
- Updates to lean disk server with Chris and Ian
- Tier1 tour planning
- Viglen '09 disk testing
- TOASTER prep
- ATLAS WAN tuning
- Ticket tidy up
- Quattor
- Jonathan:
- continue working on disposals
- continue clearing out old filesystems on csfnfs58
- implement cron job with checks to run daily test restores of home filesystem
- Nagios configuration updates
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Fabric On-Call
Ian Mon-Sun
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
- Update (2010/03/01): new hardware is now on site and ready to be installed in the rack in R26 A5L.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.