RAL Tier1 weekly operations Fabric 20100510
From GridPP Wiki
Contents
Developments
- All:
- APRs
- Martin:
- Ian:
- Tim:
- Cheney:
- Sort out problem with openssl and websites
- Tweaking DB backups
- Tweaks to nagios
- Improve dmf backups
- Jonathan:
- shutdown restarted netnag server
- 3 Nagios configuration updates
- James A:
- Working on newish Atlas Software Server(s)
- Investigating quattorfs from MS.
- James T
- AoD Thursday
- Drive changes for Kash
- gdss397
- Phone conference with Streamline. All machines bar two through vendor testing; acceptance testing starting this week.
- APR
- Applied for new certificates for gdss87-367 and gdss478-575
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
- gdss397 crashed with single drive failure.(Intervention)
- APR.
- Streamline Engineer service call.
- Boston Engineers service call.
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Jonathan absent 2 days due to personal and family sickness
- Kashif Annual Leave (Wednesday and Thursday)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Tim:
- Cheney
- Improvements to dmf backups
- Add in new tape servers to tsbn/sls
- Jonathan:
- start regular check restores of home filesystem
- final checks of new Nagios slave and finally stop nagios01/05
- continue investigations on setting up AFS directory as Atlas software server
- Nagios configuration updates
- James T:
- Streamline '09 testing
- APR/JOb plan
- Update certificates on disk servers where needed
- Define disk server benchmarking procedure
- Fill in for Kash on disk server maintenance
- James A:
- Duplicating software from current Atlas software server to new standby server.
- Returning Streamline Storage Node networking to original configuration so acceptance testing can be started.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Jonathan 1 day A/L - Tuesday
- Ian @CERN
- Martin @CERN