RAL Tier1 Incident 20101216 Broken Tape Data Loss Alice

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 16th December 2010: Broken Tape Leads to Data Loss for Alice.

Description:

During normal operations tape CS2806 broke resulting in data loss for the Alice VO.

Impact

At the time of the break tape CS2806 contained a total of 11347 files. Of these 3478 were recovered from staging disk leaving 7869 files lost. the list of lost files was provided to Alice who responded that all the files had replicas elsewhere.

Timeline of the Incident

When What
13th Dec 2010 Errors detected on drive. Problem reported to Sun/Oracle (new helpdesk)
14th Dec 2010 Sun/Oracle responded with request for more information. Engeneer assigned
15th Dec 2010 Engineer on site. Reported tape broken. New drive ordered. Started building list of lost files/those still on disk
16th Dec 2010 Drive replaced. Some files recovered via repack scheme.
17th Dec 2010 Recover files still on staging disk by reading out and writing back in. Left over weekend recovering files
20th Dec 2010 16:58 E-mail to ALICE informing of data lost and providing lost file list.
21st Dec 2010 09:36 Response from ALICE. Checking files - in principle all replicas.

Incident details

Tape CS2806 broke during normal read/write (and rewind) operations. The initial symptom was that the tape could not be removed from the drive. The maintenance engineer was called and discovered the tape broken inside the drive. The engineer removed the tape and replaced the tape drive.

The tape contained 11347 files. Once the breakage was known as many as possible of the files were recovered from the staging disk (a total of 3478 ) and put back into Castor (and on to tape). Alice were then informed of the data loss and provided with a list of the 7869 lost files. These files were then deleted from Castor by Alice. During January all files had been deleted and then the tape was deleted from the system

Analysis

Tape breakage is a rare occurrence. As to why this particular tape broke is not known. However, it is noted that it contained a lot of small files and writing and reading requires many more tape stops & starts than an equivalent amount of data in large files, with a commensurate increase in the likelihood of a problem. The automatic tape monitoring did not show errors for this tape before it failed. This is the first tape breakage since the automatic monitoring of tape error rates was introduced approximately one year earlier.

The RAL Tier1 was prepared to send off for specialised data recovery should there have been files the VO (Alice) regarded as particularly important. This was not the case. However, part of the tape was destroyed during the breakage and data recovery would not have been possible for all files.

There were some delays in notifying Alice of the tape breakage. Notably the recovery of files still on the staging disk took place over a weekend and the notification sent once that was complete. Whilst there was no certainty about the list of recoverable (and therefore list of lost) files until this operation as complete, a list of all the files on the tape could have been provided earlier. In some cases a VO could still have files on staging disks elsewhere. Giving a quick warning of possible data loss might enable files to be recovered from other sites.

Follow Up

Issue Response Done
There a risk the tape drive is faulty and could drive break more tapes. It has been confirmed that the engineer replaced the tape drive. Yes
Storing crucial data on a single tape, and an additional risk if many small files may lead to unrealistic expectations from users. Ensure users are aware of risks of data loss and implications of storing small files on tape. Note added following review on 28/06/11. This items has come up at a number of meetings etc. The users should already be aware of the limitations of tape storage. N/A
Do not have costs or time delays in sending tape for specialist recovery available. Obtain costings and timings for data recovery from broken or damaged tapes. No

Reported by: Gareth Smith 23rd December 2010

Summary Table

Start Date 13th December 2010
Impact Data Loss
Duration of Outage N/A
Status Open
Root Cause Hardware
Data Loss yes