RAL Tier1 Incident 20100212 Tape problems led to data loss

From GridPP Wiki
Revision as of 15:13, 28 May 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Two separate problems with tapes led to file (data) loss

Site: RAL-LCG2

Incident Date: 2010-02-12

Severity: Severe

Service: CASTOR Tape

Impacted: CMS

Incident Summary: Two separate problems were discovered on two tapes. While the issues were unconnected, both led to file (data) loss.

Type of Impact: Data Loss

Incident duration: N/A

Report date: 2010-02-15

Reported by: Gareth Smith, Tim Folkes

Related URLs None

Incident details:

As a result of routine tape monitoring during a 'repack' operation problems were found on two tapes that contained CMS data. The tapes were written at widely separated times and the root causes of the two faults are not connected.

First Tape. Tape written CS1472 found to be giving hard errors on reads and writes. 102 files lost. Tests showed that tape media is defective.

Second Tape. Tape written CS3410 found to have problems reading files. Investigation showed that although castor claims to have written 327 files all the way to the end of the tape, the double tape mark indicating the end of the data was after file position 222. Can not tell if data was written after this point or not. 105 files lost. The first 222 files on the tape are available.

Approximately 50 of the files were recovered from disk. However, as both tapes in Castor D0T1 service class it was not expected that many disk copies would be available locally at RAL.


Future mitigation:

Further efforts after the failure could be considered, such as sending tape CS1472 away for the professional data recovery people to have a look at. However, CS3410 was a software error rather than a media error and we don’t know if they have the ability to read past a double tape mark.

A review of proactive procedures, to assess if anything (such as 'scrubbing' or continually reading tapes - as done elsewhere) should be undertaken. This in turn requires monitoring of our tape failure rates to compare with industry averages.

Note: Added May 2013 when closing this incident: Tape scrubbing or 'verification' (as referred to in Castor) has been enabled and is now in regular use to validate tapes.


Related issues:

None

Timeline

Date Time Comment
CS1472 tape written 2010/1/7 CMS data migrated to tape no:
CS3410 tape written 2009/10/17 CMS data migrated to tape no:
Problem first discovered on tape CS1472 2010-02-11 During repack
Problem first discovered on tape CS3410 2010-02-10 (or thereabouts) During user read
CMS notified of issue approx: 2010-02-11
CMS provided with list of files lost. 2010-02-12