Operations Report 16/11/2009

From GridPP Wiki
Jump to: navigation, search

Summary of Progress on Backup and Recovery testing

The 3D backup and recovery tests are part of the wider database team activity to test backup and recovery procedures at regular intervals. Current work includes:

* updating our backup documentation
* creating a new procedure for recovery (done – on wiki)
* scripting this new procedure
* other DBAs to use procedures written by DBA colleagues (i.e. to check procedure)

Richard is currently testing the recovery of a RAC database to a single-instance database, and will now start work on a RAC to RAC recovery. This will be performed on version 10g (Tier-1) and 11g (others).

Once these procedures are completed and tested then a framework will be setup to regularly test the 3D database backup/recovery. Additionally, preparation is being made for the backup/recovery tests during the 3D workshop at CERN next week.

everything is expected to be in place by Mid December

Summary of Previous Week

Developments

  • Castor: Oracle patch applied (Neptune, Pluto, Uranus)
  • Castor: NS trigger preparation
  • Castor: Oracle reinstalled on Vulcan
  • Tier1: Tested some backup restore scenarios
  • Tier1: Resilience testing plan for the coming week
  • 3D: Oracle patch applied(Ogma, Lugh)


Operational Issues and Incidents

  • RAC failover mechanism was slow; The problem has been fixed.


Plans for Week(s) Ahead

  • Castor: Finalise NS trigger
  • Tier1: Make ASM to mount an old disk array
  • Tier1: Test the script to mitigate against database connecting to “old” ASM mirror
  • Tier1: Continued work on backup/recovery procedures (and documentation)
  • Tier1: Start to automate the backup restore procedure
  • Tier1: Updating disaster procedures


Downtimes and At Risk

Description Start End Affected VO(s)

Development Priorities

  • CASTOR Database Monitoring
  • Migrate ATLAS TAGs to 64bit systems
  • Investigate ORACLE replication technique for LFC/FTS resilience
  • Investigate hardware architecture, backup and recovery strategy, resilience and validation of restored backup. .

Requirements and Blocking Issues

Description Required By Priority Status
Hardware for Tag databases Medium Waiting
Hardware to test LFC database replication Medium/high Waiting

OnCall

  • Carmine