RAL Tier1 Incident 20101225 CMS Disk Server GDSS283 Data Loss

From GridPP Wiki
Revision as of 12:06, 4 July 2011 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Incident 25th December 2010: CMS Disk Server GDSS283 Failed with Data Loss

Description:

Disk server GDSS283 (CMSFarmRead) a Disk0Tape1 service class failed on Christmas Day. Following investigations once RAL reopened after Christmas it was found that all three Castor partitions were corrupt. 30 files that had not been migrated to tape were lost.

Impact

30 CMS files from the Winter10 MC re-digitization/re-reconstruction campaign were lost. They were invalidated and announced as lost to the CMS community on 14th January.

Timeline of the Incident

When What
25th December 20:34 Monitoring reports read-only file system
26th December 22:47 Start memory test
4th January Lab re-opens after Christmas. Continue investigations
4th January 10:59 No error found during memory test
Friday 7th January Running fsck - errors seen in one of Castor partitions.
Tuesday 11th January 14:20 Conclude one of the Castor Partitions is corrupt.
Friday 14th January 10:20 After checks on individual files conclude files bad on all partitions. Declare 30 files lost.

Incident details

Disk server GDSS283 (CMSFarmRead) a Disk0Tape1 service class failed on Christmas Day. Memory tests were started the following day and ran until 4th January but did not show anything. Further testing resumed on 4th January. A file system check (fsck) reported errors on one of the three Castor partitions on the RAID array and initially it was thought that the 13 un-migrated files on that partition was all that were lost. However, on checking it was found that files on all three Castor partitions were corrupt. All 30 files that had not been migrated to tape were lost.

Analysis

The failure of GDSS283 occurred on Christmas Day. Whilst memory tests were run from the following day, investigations into the system did not start in earnest until RAL reopened after the holiday on 4th January. Once investigations were ongoing there was no statement of the importance of the data that was potentially lost, and a lack of urgency in the investigations. Initially problems were reported on one of the three data partitions on the disk system, but there was a significant delay in checking the state of files on the other partitions. Whilst fsck did not report errors on these other partitions all the un-migrated files on them were found to be corrupt.

The root cause of the failure is not yet known and further investigations should take place to try and understand this.

Follow Up

Issue Response Done
Understand the Root Cause of the failure not known. Analyse failure with aim of preventing further occurrences. Note added following review on 28/06/11. No further understanding likely to be forthcoming at this stage. N/A
Delays in investigating the problems on the disk server and hence announcing the data loss. The disks server intervention procedure to be modified to make an assessment of the importance of any potential loss and to ensure an appropriate priority is assigned to the ticket/task. Note added following review on 28/06/11. procedures now in use do assess priorities. Yes

Related issues

Post Mortem reviews of other disk servers have made suggestions for reviews of the disk server intervention procedures for other reasons. See:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101108_Atlas_Disk_Server_GDSS398_Data_Loss

Reported by: Gareth Smith. 18th January 2011.

Summary Table

Start Date 25 December 2010
Impact Loss of 30 files.
Duration of Outage 1 month
Status Closed
Root Cause Unknown
Data Loss Yes