Difference between revisions of "RAL Tier1 weekly operations Fabric 20100823"
From GridPP Wiki
(No difference)
|
Latest revision as of 14:44, 23 August 2010
Contents
Developments
- All:
- Martin:
- Ian:
- Tim:
- Jonathan:
- Out all week
- James A:
- James T
- Cheney
- ADS cache node ran out of disk space
- trying to locate key to unlock disk array
- move preprod database to florence disk array x 2
- investigate security alert on ads0pt02
- regenerate web stats for hinode external customer
- fix backups on buxton-kiki
- make space on dmf
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss110 fsprobe errors. (Intervention)
- Replaced 1 drive in Streamline 2009 (Test) disk servers.
- gdss490,492,499, 501 and 505 crashed during acceptance testing. (reported)
- gdss381 crashed with single drive failure. (Intervention)
- lcgfts02 replaced both drives.(Fixed)
- gdss280 fsprobe errors (Intervention)
- Hardware failure stats/graphs.
- Preparing Viglen 2006 disk servers with new raid configuration for Castor Preprod.
- Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Jonathan on leave Tuesday - Thursday so out all week
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Tim:
- Cheney
- Facilities castor
- Jonathan:
- update imapd certificate on pat
- start Nagios process on new slave server (nagios01 for batch workers) and shut down old Nagios slave servers once stable
- release new versions of RPMs tier1-nagios-plugins, tier10-sudo-config and tier1-nrpe-config; for change to RPM tier1-nrpe-config make equivalent change to Quattor configuration
- James T:
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss417 crashed. (Intervention)
- gdss468 down (Intervention)
- Update daily status of Streamline 2009 disk servers testing.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Fabric On-Call
- Kashif Hafeez