Operations Report 07/12/2009

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • 3D workshop
    • We explained the events sequence that cause the data lost
    • Backup policy validated
    • We did few exercises on restoring a database from backup.

Developments

  • Tier1: The script which will avoid ASM from mounting the old disk array has been validated by Oracle
  • Tier1: Ongoing work on testing backups

Operational Issues and Incidents

  • FTS: One FTS session did lock over 100 other FTS sessions. Time out error generated by locked session.
  • FTS: One of the FTS service node did crush last night. FTS service was not affected because it is running on two nodes
  • 3D: ATLAS and LHCb have been hit by an Oracle bug in the streaming process. Under investigation

Plans for Week(s) Ahead

  • Tier1: Continued work on backup/recovery procedures (and documentation)
  • Tier1: Automate the backup restore procedure
  • Tier1: Start to create a plan about migrating Tier1 databases back to EMC kit
  • Tier1: Re-execute the ASM tests against the database when it is using the EMC.

Downtimes and At Risk

Description Start End Affected VO(s)

Development Priorities

  • Revalidate the ASM configuration when EMC are back in production
  • CASTOR Database Monitoring
  • Migrate ATLAS TAGs to 64bit systems
  • Investigate ORACLE replication technique for LFC/FTS resilience
  • Investigate hardware architecture, backup and recovery strategy, resilience and validation of restored backup.


Requirements and Blocking Issues

Description Required By Priority Status
EMC kit At least a week before going in production High Waiting
Hardware for Tag databases Medium Waiting
Hardware to test LFC database replication Medium/high Waiting

OnCall

  • Carmine