Operations Report 12/02/2010
From GridPP Wiki
Contents
Summary of Previous Week
- Setting up test environment
- Comparing test environment with production one
- 18 crush tests to reproduce production problems
- Adding lust archived logs to Automated Recovery Script (reduce the lag between actual data and backup copy)
- Tests to add / remove cluster nodes
Operational Issues and Incidents
- Tier1: still some EMC/Oracle problems on PLUTO/NEPTUNE. Tring to reproduce on test system.
Plans for Week(s) Ahead
- CASTOR nodes Upgrade
- Resiliance tests
- Node manipulation (add/remove) tests
Downtimes and At Risk
Description | Start | End | Affected VO(s) | Type |
---|---|---|---|---|
Memory upgrade on Castor nodes | start of the week | start of the week | ATLAS, LHCb | At risk |
Moving backup area from cdbc08 to spare node | start of the week | start of the week | ATLAS, LHCb | At risk |
Disable clusterware on cdbc08 node | start of the week | start of the week | ATLAS, LHCb | At risk |
Development Priorities
- Deploy CASTOR Database Monitoring
- Migrate ATLAS TAGs to 64bit systems
- Investigate ORACLE replication technique for LFC/FTS resilience
- Investigate hardware architecture, backup and recovery strategy, resilience and validation of restored backup.
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
Hardware for new Neptune node | Medium/high | Waiting | |
Hardware to test LFC database replication | Medium/high | Waiting |
OnCall
- Keir
Absences
- Rich Out Until 22nd February
- Carmine Out Until 17nd February
- Keir Away on 16th February