Difference between revisions of "RAL Tier1 weekly operations Fabric 20100222"
From GridPP Wiki
Martin bly (Talk | contribs) |
(No difference)
|
Latest revision as of 15:57, 22 February 2010
Contents
Summary of week gone
Developments
- All:
- Martin:
- Delivery of Viglen 09 Disk servers part 1 (26/38) and Viglen 09 CPU.
- Minor procurements
- iSCSI arrays
- Work on additional C300 switch
- Work on Services/Virtualisation/Quattor servers
- Ian:
- Work on Castor-Fabric handover
- Prototype core castor server in Quattor
- James T:
- Accepted Viglen 2008 disk.
- Assisted Viglen with the installation of 2009 disk.
- Quattor.
- Some progress on ironing out disk server installation problems.
- Fixed stage/st user/group not existing when CASTOR RPMs installed.
- Thought about 64-bit + XFS.
- Worked on tour for Tier1 open day with James Adams.
- Dealt with disk faults in Kash's absence.
- Jonathan:
- nuked lcgdb98 and removed from yumit, mimic, atlasbackup
- updated RPM tier1-nrpe-config on many farm nodes for new Nagios slave server
- sent message to user about cracked password
- issued Change Control for new Nagios server
- started preparing new versions of RPMs tier1-nrpe.config, tier1-sudo-config and tier1-nagios-plugins
- Nagios configuration updates
- worked on Quattor configuration of Nagios slave servers
- James A:
- Supported Viglen during delivery.
- Completed preparation for Streamline delivery.
- Blanked loaned hardware.
- Started work on Tier1 tours with JIT.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss211 reinstalling.
- gdss294 replaced 4x2gb memory and given back to castor. (Fixed)
- gdss282 replaced 16 ports raid card (Fixed)
- nc21 (lcg0280) found faulty memory. - Intervention
- Viglen 2006 eight disk servers for decommissioning/prepod. (Labeled and configured)
- lcgce07 faulty drive.(scheduled for Monday 22nd Feb 2010)
- gdss171 given back to castor.
- Cabling for new systems in HPD room with James A.
- Working on 2008 Disk servers and working nodes.
- Working on gdss211 and 295.
Absences
- Jonathan on partial retirement, worked Tuesday, Thursday and Friday
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All | |
gdss364 disk controller sick | Friday ~20:00 | Ongoing | Severe | CMS (FarmRead) |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Delivery of remaining Viglen 09 disk (12/38) - Wednesday
- Delivery of Streamline 09 disk (40/60) - Thursday/Friday
- Minor procurements:
- Additional C300
- Services/Virtualisation/Quattor servers
- Ian:
- Helping JAmes T with Quattorised disk server
- Castor handover work
- Further work on Quattorised CAstor servers
- Investigate options for further hardware monitoring
- James T:
- Quattor:
- Resolve disk server install problems.
- Document seployment procedure using Quattor
- Work on Tier1 tour
- Identify 5 machines to deploy for ATLAS
- ATLAS WAN tuning
- RA training Monday 11.00 to 14.00
- Admin on duty Tuesday and Wednesday
- Fabric on call Monday 22nd to Monday 1st
- Quattor:
- Jonathan:
- complete work on installing Nagios slave servers via Quattor
- implement cron job with checks to run daily test restores of home filesystem
- Nagios configuration updates
- release new versions of RPMs
- James A:
- Focusing on last of SINDES integration.
- Annual leave on Friday.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- lcgce07 drive replacement. (Hot swap)
- Replacing memory in nc21 (lcg280).
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss211 and 295.
Absences
- Jonathan: working Tuesday, Wednesday and Thursday
- Kashif: A/L Wednesday and Thursday.
- James T: Training 11.00 - 14.00 Monday
- Martin: A/L Friday pm
Fabric On-Call
James T. Monday - Monday
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.