Operations Report 07/12/2009
From GridPP Wiki
Contents
Summary of Previous Week
- 3D workshop
- We explained the events sequence that cause the data lost
- Backup policy validated
- We did few exercises on restoring a database from backup.
Developments
- Tier1: The script which will avoid ASM from mounting the old disk array has been validated by Oracle
- Tier1: Ongoing work on testing backups
Operational Issues and Incidents
- FTS: One FTS session did lock over 100 other FTS sessions. Time out error generated by locked session.
- FTS: One of the FTS service node did crush last night. FTS service was not affected because it is running on two nodes
- 3D: ATLAS and LHCb have been hit by an Oracle bug in the streaming process. Under investigation
Plans for Week(s) Ahead
- Tier1: Continued work on backup/recovery procedures (and documentation)
- Tier1: Automate the backup restore procedure
- Tier1: Start to create a plan about migrating Tier1 databases back to EMC kit
- Tier1: Re-execute the ASM tests against the database when it is using the EMC.
Downtimes and At Risk
Description | Start | End | Affected VO(s) | |
---|---|---|---|---|
Development Priorities
- Revalidate the ASM configuration when EMC are back in production
- CASTOR Database Monitoring
- Migrate ATLAS TAGs to 64bit systems
- Investigate ORACLE replication technique for LFC/FTS resilience
- Investigate hardware architecture, backup and recovery strategy, resilience and validation of restored backup.
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
EMC kit | At least a week before going in production | High | Waiting |
Hardware for Tag databases | Medium | Waiting | |
Hardware to test LFC database replication | Medium/high | Waiting |
OnCall
- Carmine