RAL Tier1 weekly operations Fabric 20100816
From GridPP Wiki
Contents
Developments
- All:
- Martin:
- Ian:
- Work on virtualisation testbed
- Configuring iSCSI storage
- Tim:
- Jonathan:
- got new host certificate for pat, saved on touch and created Change Control for update of imapd certificate
- applied 2 updates to source for RPM tier1-sudo-config for CMS
- corrected problem with logrotate on lcgcts0[1-9] by correcting Quattor configuration
- AFS user assistance
- user changes
- 1 Nagios configuration update
- installed slave server for batch workers (nagios01) using Quattor, created initial configuration for nagios01 and verified it; submitted Change Control for switch to new slave
- updated source for tier1-nrpe-config RPM
- James A:
- Planning cabling for Castor racks G and H.
- General Quattoryness.
- James T
- Disk server bits and pieces in Kash's absence.
- Streamline 2009 testing:
- 57 machines testing OK, due to finish on Wednesday 18th August.
- Two machines stopped testing. Streamline, LSI and Wd are looking into why.
- One machine still at LSI until problem with the two machines has been diagnosed.
- Work on new Areca firmware for Streamline 2008 machines to fix problems with arrays going offline.
- Testing on gdss380 has run for a week without problems.
- Three machines have now shown this problem (gdss380,381,417) so we've escalated to Streamline.
- Started to re-configure IPMI on the disk servers to use the 10.0.0.0/8 network addresses and document access.
- Initial stab at quattorised logger (work in progress).
- Cheney
- not around.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss110 fsprobe. (Started memtest)
- Replaced 2 drives in Streamline 2009 (Test) disk servers.
- gdss490 and gdss492 crashed during acceptance testing. (reported)
- gdss381 crashed with single drive failure. (Intervention)
- lcgfts02 replaced drive (sda).
- Hardware failure stats/graphs.
- Preparing Viglen 2006 disk servers with new raid configuration for Castor Preprod.
- Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Preparation for GridPP 25
- Further work on virtualisation testbed
- Tim:
- Cheney
- cutover to another disk array for castor preprod
- Jonathan:
- On leave Tuesday-Thursday, so out all week
- James T:
- A/L Monday and Tuesday
- Disk server IPMI set up
- Quattorised loggers
- James A:
- Cabling Castor racks G & H.
- Various Nagios test changes and developments.
- Planning and administration for meetings.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss417 crashed again. (Intervention)
- Update daily status of Streamline 2009 disk servers testing.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Jonathan on leave Tuesday - Thursday (so out all week)
- Cheney on leave thursday
- Tim on leave all week
Fabric On-Call
- Kashif - Monday-Sunday