RAL Tier1 Incident 20080917 Castor data loss following catalogue synchronisation

From GridPP Wiki
Jump to: navigation, search

Site: RAL-LCG2

Incident Date: 17/09/08

Severity: Field not defined yet

Service: CASTOR

Impacted: LHCB (ATLAS down but no lost data)

Incident Summary: CASTOR catalogue synchronisation process accidentally deleted 14,000 files. Cause is still not confirmed, but may be related to an ORACLE crosstalk problem. ATLAS and CASTOR instances were down about 17 hours (LHCB) and 12 hours ATLAS

Type of Impact: Data Loss & Downtime

Incident duration: 17 hours downtime for LHCB

Report date: 7/10/08  !!!(INTERIM - Updated)!!!

Reported by: Andrew Sansum

Related URLs: None

Incident details:

On 17/09/2008 around 16:00 problems were experienced initially with the LHCB CASTOR instance and subsequently with the ATLAS instance. Other than failing transfers, initially no error messages were logged in CASTOR but eventually we began to have ORA-1403 messages (no data found). Database team called-out but all databases appeared to be functioning normally with no error messages. Duty D/B on-call carried out a listner restart but was not convinced there was a fault in the database. Next morning 18/09/08 09:40 database was restarted and fault was resolved.

On the 22/09/2008 20:47 LHCB reported a missing file. An audit was performed of all files on disk for lhdbDst, which uncovered about 14,000 files had been lost. A list of all files was provided to Raja Nandakumar on Wed. This was discussed with the UK LHCB representative at the RAL CASTOR meeting on Wednesday at the regular review meeting.

Currently our interim theory for cause of the loss was that during the incident on 17/09, at 23;00 the CASTOR catalogue synchronisation process had started. Files existing on the LHCB disk servers were queried in the nameserver database, but possible crosstalk between the LHCB and ATLAS databases lead to the lookup being carried out on the ATLAS d/b. When the name was not located in the database, CASTOR deleted the file from disk. Investigations to confirm that this was actually so are still underway.

Future mitigation:

Investigation is still under way and the cause is not yet identified. As the current suspicion is a problem caused by the ORACLE RAC infrastructure, the following actions were agreed at the weekly CASTOR incident review on Wednesday 24th September:

1. Switch off synchronization (Bonny)

2. Prepare for patching oracle systems on Neptune and Pluto Tuesday next week. (Carmine)

3. Follow up with ORACLE about our error it still exists on our version. (Carmine)

4. Consider separating atlas from LHCb database instances. (DC – discussion scheduled for Friday)

5. Consider temporary CASTOR build which is schema specific – to be discussed with CERN (DC/SDW)

6. Ask ATLAS to postpone further bulk deletions (Brian)

7. Start to generate post mortem for GRIDPP (Andrew)

Subsequent to Wednesday.

1. Synchronisation is "switched off" actually set to occur every 2000 years.

2. Done

3. Oracle have been contacted, but no production upgrades to Oracle until response from Oracle.

4. Cost of doing further separation is too high on manpower for significant gain, until such times as cross-talk has been convincingly proved. Worth pursuing possibility of further funding for making CASTOR bomb-proof (new action DC).

5. Done. Concluded too error prone

6. Done. ATLAS bulk deletes are confirmed to be off.

CASTOR and D-base teams to further investigate root-cause of LHCB lost files, considering possible alternatives causes in addition to cross –talk. (BS/GB)


It has not yet been proved that crosstalk occured in the ORACLE RACs although circumstantial evidence suggests this is so. Logfile retrieval from backup is underway to allow d/b team to confirm.

Update of 7 October 2008

Investigations in to the cause of the problem (lost files on LHCb) have proved inconclusive. There is a strong hint that we have seen an Oracle bug of SQL executing in the wrong schema but it is difficult to prove.

We can not recreate the problem on the production database as we can’t risk losing more files. With that in mind, we will be using the test cluster to try and simulate what happened (with logging and auditing switched on).

Oracle are concerned about the state of the data dictionary for the database that had the problems and have asked us to reinstall it. We are looking at doing this - and applying the patches noted from CERN and Oracle.

Oracle have said that the “SQL executing in wrong schema” bug has not been seen in the version (10.2.0.3) we are running. The duplicate Service Request is still open (SR 19633370.6) and they are helping with other errors in this database.

While we will continue to investigate this problem I think it will be unlikely we can prove the cause or recreate the problem. I will update you if we do find anything of note.


Related issues:

Anything else relevant

Timeline

Date Time Comment
Actually Started 17/09/2008 16:10 Close to this time anyway
Fault first detected 17/9/08 16:10 Initially CASTOR admin spotted large pending LSF queue, soon followed by Nagios/SAM
First Advisory Issued 17/9/08 18:26 Unscheduled downtime announced until 13:00
First Intervention 17/9/08 Soon after initial alarm. Handled by daytime operations
Fault Fixed 18/09/08 09:40 When was the problem resolved
Announced as Fixed 18/09/08 10:30 Downtime revoked and GRIDPP-USERS
Downtime(s) Logged in GOCDB 17/09/08 17:16 UTC 18/09/2008 09:45 UTC Unscheduled downtime on LHCB Instance
Downtime(s) Logged in GOCDB 17/09/2008 22:31 UTC 18/09/2008 09:30 UTC Unscheduled downtime on ATLAS Instance
Other Advisories Issued 17/09/08 23:53 ATLAS UK Computer operations notified of downtime.
Other Advisories Issued 18/09/08 07:00 Raja Nandakumar notified by on-call that LHCB instance down
Other Advisories Issued 18/09/08 09:44 GRIDPP Users notified by night-time on-call that ATLAS and LHCB instances are down
Other Advisories Issued 18/09/08 10:30 CASTOR users notified that instances are back up
Other Advisories Issued 18/09/08 10:39 GRIDPP-Users notified that Instance is back up
Other Advisories Issued 18/09/08 10:39 ATLAS UK operations notified that Instance is back up
Other Advisories Issued 23/09/08 11:57 Raja notified that we had lost unknown number of LHCB files )