RAL Tier1 weekly operations Fabric 20100215
From GridPP Wiki
Contents
Summary of week gone
Developments
- All:
- Martin:
- Minor procurements
- Preparation for deliveries
- Ian:
- Change control plan for WN OS update
- Work on Castor-Fabric integration
- Debugging filesystem problems on Nagios slave
- Learning about Quattor core development
- James T:
- All Viglen 08 disk servers finished testing & re-installed with holding configuration.
- Quattor
- Script to create hardware templates and machine profiles for disk servers based on the hardware database and overwatch.
- Added LSF accounts
- Fixed ganglia after network intervention.
- Jonathan:
- Administrator on Duty (Wednesday)
- sorted out atlasbackup problems on several nodes
- updated /etc/mail/local-host-names on pat
- updated RPMs on Nagios slave servers and rebooted for new kernel
- updated RPMs on core servers
- Nagios configuration updates
- new versions of RPM tier1-nrpe-config
- worked on Quattor configuration of Nagios slave servers
- James A:
- Bulk of time spent preparing floor-space and networking for Viglen and Streamline deliveries.
- Added Hardware database content for Quattorised disk-servers and new deliveries.
- Some progress on SINDES integration.
- Made some fixes to the hardware database.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss211 completed 7 days acceptance test.
- gdss93 and gdss161 given back to castor. (Fixed)
- gdss77 replaced 4x1gb memory also two drives.(Fixed)
- nc21 (lcg0280) found faulty memory. - Intervention
- lcgce07 faulty drive.
- gdss130 and gdss172 given back to castor.
- gdss86 replaced 4x1gb memory and raid card memory.
- gdss364 replaced 16 ports raid card. (Fixed)
- gdss294 kernel panic. (faulty memory) - Intervention
- Cabling for new systems in HPD room with James A.
- Working on 2008 Disk servers and working nodes.
- Working on gdss77, 282 and 294.
Absences
- Jonathan on partial retirement, worked Tuesday, Wednesday and Thursday
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Minor procurements
- More prep for deliveries
- Ian:
- TDG talk on Quattor (done)
- Further work on integrating Castor HW in fabric workflow
- Setting up Quattor testbed
- James T:
- First Aid course on Tuesday
- Viglen 09 deliveries
- Work on Tier1 tour for open day
- First aid course on Tuesday
- Quattor
- Fix python dependency problem during fresh installs
- SL5 64-bit disk server build
- Jonathan:
- complete work on installing Nagios slave servers via Quattor
- implement cron job with checks to run daily test restores of home filesystem
- Nagios configuration updates
- James A:
- Continued preparation for deliveries.
- Liaison with suppliers during installation.
- Continue with SINDES integration.
- Continue with work on Hardware database.
- Blank and return loaned hardware.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- lcgce07 drive replacement. (Hot swap)
- Continuous work (memory replacement) with Cheney.
- Viglen 2006 eight disk servers for decommissioning/prepod. (Label and configure)
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss77, 282 and 294.
Absences
- Jonathan working Tuesday, Thursday and Friday
- James out on a first aid course on Tuesday
Fabric On-Call
Ian Primary OnCall Mon-Thurs
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on hardware provision for Services team testbeds.