RAL Tier1 Incident 20110202 Tape Data Loss LHCb

From GridPP Wiki
Revision as of 09:14, 2 March 2011 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Incident 20110202 Tape Data Loss LHCb

Description:

Errors were seen relating to a tape holding LHCb data. On investigating it was found that part of the data on the tape had been overwritten by a faulty tape drive. A total of 78 LHCb data files were declared lost.

Impact

The loss of 78 LHCb files from the lhcbRawRdst service class. All of these files were RAW data types.

Timeline of the Incident

When What
18th Nov 21:53 Tape drive event (RVA reported a soft-write error, tape library spotted that there had been an event on the tape drive.) Tape marked as non-writeable.
Subsequently tpdump of tape. This did show problems but the usual checking of such output missed the problem due to its unusual nature. Tape marked as OK for read/write.
1st February 11:45 Tier 1 tape system manager notifies rest of team that there is a problem with tape CS7541 (LHCB). Some recoverable data from the tape has been copied elsewhere. Efforts continuing to understand the problem and see if further data recovery possible.
1st February 12:00 Garbage Collection turned off for all LHCb disk servers to ensure no further file copies that may still be on disk are deleted.
1st February 14:45 Creation of GGUS ticket by RAL Tier1 to inform LHCb formally of possible data loss.
2nd February 10:00 Confirmation of data loss to LHCb.
2nd February 12:00 Garbage Collection turned back on for LHCb disk servers following recovery of as many files from the faulty tape as possible.
3rd February LHCb copy data files back from replicas (at CERN).

Incident details

On Tuesday 1st February it was noticed that tape CS7541 was giving "Incorrect or missing trailer label on tape" error messages.

The tape was repacked to rescue those files that it could. This left 279 files that Castor still thought were on the tape and could not be copied off. Of these 201 files were still on disk and these were written out to tape again. The remaining 78 files were declared lost to LHCb.

Analysis

The corruption of the data on tape CS7541 was discovered on Tuesday 1st February when errors were reported when trying to write more files on the tape. Initial attempts were made to recover files from the tape along with files that were still on the staging disks. Garbage collection was turned off for LHCb disk servers in order to ensure no further files were deleted from the staging area.

An analysis of the tape shows that the first 7 files were OK, but the trailer label of the 7th file had been over-written by file 513. Files 513-518 were OK. The upshot is that files 8-512 are lost. Some have been deleted so this left the 278 files to recover.

The overwrite occurred on 18th November. At 00:27 on that day the RVA logs show a "soft error"; Tape library logs show a "drive error"; Tape server log shows files 512-514 being written at the time -

Nov 18 00:29:57 rtcpd[17850,3]: CPDSKTP ! I/O ERROR WRITING ON TAPE: wrttpmrk: TP042 - PATH17850 : ioctl error : No sense

(BLOCK # 2) CPDSKTP ! TAPE IS NOW INCORRECTLY TERMINATED

Nov 18 00:29:58 rtcpd[17850]: request failed

Subsequently the drive on this server (castor201) was replaced.

It is noted that had the tape been repacked at the time, or some tape verification (scrubbing) process been in place the bad files would have been identified earlier.

The discovery of this problem from the logs of tape activity was delayed owing to a separate unrelated problem while migrating files to tape for LHCb creating many error messages.

It should be noted that no incidents like this were seen when migrating all CMS data from T10KA to T10KB tapes. Remaining data (all other VOs) will be migrated from the T10KA tapes starting in 2011 and will pick up any other instances of this. The planned upgrade to Castor 2.1.10 will add the capability to do tape 'scrubbing' which is a verification of tape contents.

Follow Up

Issue Response Done
Root cause Probably a hardware problem in a tape drive that was since replaced as it showed other errors. Yes
Could there be other tapes with this problem, and what can we do to ensure it does not happen again. The tape library logs have been checked and no evidence of any other similar event(s) found. A regular check of this log for this type of event should be instigated. No
Could there be other tapes with this problem, and what can we do to ensure it does not happen again. A regular check through the Castor logs of tape access for this type of problem should be instigated. No

Related issues

Although this is first time we have seen one of these overwrites the following Post Mortem refers to another type of data loss from tape. Notably this proposed using tape scrubbing (referred to above) to validate tape contents.

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100212_Tape_problems_led_to_data_loss


Reported by: Tim Folkes & Gareth Smith on 4th February 2011

Summary Table

Start Date 1st February 2011
Impact Loss of data
Duration of Outage N/A
Status Open
Root Cause Hardware
Data Loss Yes