RAL Tier1 Incident 20091009 Castor data loss

From GridPP Wiki
Jump to: navigation, search

Current Situation: Incident Ongoing. Report not complete.

Error in restore of Castor Database led to Data Loss.

Site: RAL-LCG2

Incident Date: 2009-10-09

Severity: 4

Service: Storage (Castor)

Impacted: All local VOs

Incident Summary: Following the failure of both disk systems used to host the Castor databases it was necessary to carry out a database restore. The restore completed but the live database was subsequently found to have dated from an earlier point in time. This resulted in data loss from Castor.

Type of Impact: Data Loss

Incident duration: N/A

Report date: 2009-10-20

Reported by: Gareth Smith, Gordon Brown, Keir Hawker

Status of this Report: Closed

Related URLs: The failures in the hardware underneath the Castor databases are dealt with in a separate incident report at: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091004


Description of the Castor database Systems.:

There are two Castor databases:

  • PLUTO - which contains the Castor nameserver database (for all instances) and the stager databases for the CMS and 'GEN' instances.
  • NEPTUNE - which contains the Castor stager databases for the Atlas and LHCb instances.

Since May 2009, the CASTOR Oracle databases have been using a pair of high end storage arrays. These have been configured as two disk groups (the actual Oracle terminology is FAIL GROUP) with the mirroring being performed by the Oracle Automatic Storage Manager (ASM) layer. These do not act as a classical master/mirror setup, they are just two disk groups with the same data being written to each. The database can be run on either disk group.

Apart from the redundancy, this setup was chosen because of the ability to add other disk groups without having to offline the database. This would be essential to migrate to new disks or add more capacity to the existing hardware.

Incident Overview:

Before the failure on 4th October that led to loss of service there had been problems on one of the disk groups. This had led to a previous outage of several hours on 10th September. Following that incident a patch was received from Oracle and applied (on 21st September) such that the databases would not hang when there was a failure of one of the disk groups in the same manner. From this point the databases were running on both disk groups. Following another failure of the same disk group on 24th September the system was running on only one of the disk groups while diagnosis of the hardware fault on the other group continued.

Late morning on Sunday 4th October the other disk group failed.

On Monday 5th October an initial database restore was attempted to a point just earlier than errors seen on the storage system at Saturday 3rd October 15:20. However, the ongoing underlying hardware problems led to the database running chaotically and the decision was taken to prepare alternative hardware ready to receive the database.

There were problems in restoring the database. A decision was made to restore both PLUTO and NEPTUNE to the last point at which there was consistency between them, this being Sat 3-Oct 20:00. This was done on 6th October. After space issues with a node on the PLUTO database, the database came up during the afternoon - unfortunately on the undetected “old” disk group which last had data written to it on 24th September. A full database backup was then taken "locking in" the incorrect version of the database.

Following a further incident review it was decided that data loss could be minimised by restoring PLUTO to the last point possible, and NEPTUNE to the last available point 04:00 on Saturday 4-Oct. The implications of restoring the two databases to different points having been discussed with the Castor developers. PLUTO was restored to backup from 6th October. However, unknown at the time was that this only contained data to 24th September.

On the morning of Friday 9th October the database restores and migration to alternative hardware were complete. After some hours of consistency checking Castor was enabled. The checking focused on looking for problems that might arise owing to the believed state of NEPTUNE (Atlas & LHCb stagers) being restored to a slightly earlier point than PLUTO. No inconsistencies were found and therefore the Castor services were re-enabled by 17:00 on that day.

During the day on Monday 12th October there were initial reports from CMS of files missing from Castor. These were not understood. It became clear at the end of Tuesday 13th that there was something wrong with the Castor database and the restore was not as expected. The realisation that the FileIDs in Castor had jumped back confirmed the problem. (The FileID is a unique number generated by from Oracle and it is normally increasing.)

Subsequent discussions with Castor developers (at the Castor face to Face meeting at CERN that week) suggested a problem with this re-use of FileIDs. Notably there was a theoretical way in which duplicate IDs might lead to data being deleted subsequently. This required an atypical manner of working. Nevertheless, while believing the risk small, castor was stopped briefly between 11:05 and 11:15 on Thursday 15th October for the FileID number to be jumped up beyond the range where it could be duplicated.

On Friday 16th October the PLUTO database was restored on a non-production system to a point the day before the final crash (Sat, 3 Oct 2009 04:21:47 UTC). However, synchronisation (which had been running since CASTOR was restarted) had already deleted the files on disk. Therefore this only enabled the experiments to be supplied with a list of files written to Castor during the window of the database rewind, and hence lost to Castor. This database restore was not quite to the point of the crash, the list of lost files in the last day being extracted from the Castor log files.

Future mitigation:

Issue Response
Data Integrity The main goal is to have confidence in the database always running on the most up to date disk group. Database Services are working with Oracle to identify best practice in syncing “down” disk groups – and making sure the database will run on the “latest” disk group when being restarted. A way needs to be identified to make sure that the database opens on the correct disk group automatically and every time.
Database Architecture A complete review of the current database ASM setup is taking place. Database Services are working with Oracle and CERN to further investigate the current storage policy and identify any appropriate alternatives (such as mirroring at the hardware level). Availability, redundancy and recoverability will obviously be taken in to account.
Database Backup and Recovery Procedures The RAL database Services team continues to amend its policies and was already in the process of analysing and practicing over 20 different recovery scenarios. The team will continue to gain knowledge and experience in these complicated areas and work with other institutions to refine and discuss its policies and procedures. When recovering a database, the team have noted it is vital to switch off all file deletion (i.e. through “DELETE OBSOLETE”) until the database has been recovered and verified as correct.
Database Log Checking & Time Constraints More thorough checking of the restore logs – by at least two database administrators. It is important that time is taken during the restore and recovery procedures and actions are checked carefully, fully logged and the timeline written down as it happens. Procedures will be updated to clarify exactly what to check for in the backup and recovery logs. There is a pressure on the team to restore service availability but attention to detail should not be compromised as the consequences are worse.
Complexity of operations during the incident. There were attempts to restore the databases to a number of different times during the recovery. Whilst each of these was made at the time with the aim of minimising data loss, it is clear that this conflicting requests increased the complexity of the restore tasks. This was coupled with the restores to the initial disk arrays suffering from the hardware continuing to be unreliable. At the same time there was concern to restore services. Incident procedures should be reviewed to ensure priorities are nor confused and that any database restores are to points well thought through in advance.
Faulty Database not detected before Castor re-started. The Castor team put a lot of effort into checking the system befor re-enabling services on the 9th October. Nevertheless the problem remained undetected. Generic tests to verify the data in Castor need to be devised and used. (For example verifying recently written files against the LFC). Linked to this a start-up procedure that includes time in a read-only mode, or with synchronisation disabled should be investigated.

Related issues: None.

Timeline

Date Time Comment
Actually Started 2009-10-09 17:00 Restart of Castor
First Realisation of a problem 2009-10-12 First report of missing files.
First announcement of Problem 2009-10-13 15:15 Announcement of problem with missing files being investigated.


Incident details Timeline:

Date Time Who/What Entry
2009-10-05 14:00 Castor/Database Team Initial restore of NEPTUNE database to 15:20 on 3rd October.
2009-10-05 16:00 Tier1 Disaster Management Team Invoke Disaster Management Process
2009-10-05 17:50 Database Team Start restore of Neptune database to 04:00 on 4th October. (Change from initial restore following disaster management meeting).
2009-10-06 08:00 Database Team Start to Restore Neptune/Pluto to Last Consistent Moment Between Them (Sat 3-Oct 20:00).
2009-10-06 afternoon Database Team After space issues with a node on the PLUTO database, the database comes up on the afternoon - albeit on the undetected “old” disk group which last had data written to it on 24th September. Full database backup taken.
2009-10-07 Database Team Attempts are still being made to restore PLUTO to 3rd October but this is hampered by controlfile issues after the new incarnation had been created (since understood).
2009-10-08 09:00 Database Team Restore Neptune to Last Possible Point 04-Oct 04:00
2009-10-08 Database Team Restore PLUTO to Last Possible Point due to failure to restore to 3rd October. Restored to backup from 6th October – which only contained data to 24th September.
2009-10-09 17:00 Castor team Restart of Castor Services
2009-10-13 17:45 Matt Viljoen Report have found evidence of FileIDs within Castor being re-used.
2009-10-14 13:21 Gareth Smith E-mail to experiments warning of data loss.
2009-10-15 12:23 Gareth Smith E-mail to experiments confirming data loss and giving details of time window.
2010-12-14 Gareth Smith Close this report. Confirm all mitigating actions done.