RAL Tier1 Incident 20101108 Atlas Disk Server GDSS398 Data Loss

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 8th November 2010: Atlas Disk Server GDSS398 Failed with Data Loss

Description:

FSProbe reported errors on a disk server. On investigation the partitions containing Castor data were corrupt. It was possible to recover some files, but most data on the server was declared lost to Atlas.

Impact

The disk servers contained 48026 files. Of these some 13037 were unique. It was subsequently possible to recover 595 files from the server. However, all remaining files were declared lost. By chance the server had been full and there were very few unique files containing recent data on it, which reduced the operational impact for Atlas.

Timeline of the Incident

When What
7th November 09:04 FSProbe first reports problem.
8th November 08:21 Alarm that server gdss398 is down. Callout triggered.
8th November 09:00 Intervention started.
8th November 09:39 Machine restarted and automatically starts fsck of CASTOR partitions.
8th November 10:05 Machine exits fsck and presents a rescue shell. fsck started manually on the CASTOR partitions.
8th November 10:44 ATLAS ELOG submitted.
8th November 15:00 Discussion concludes that we'll leave fsck running overnight and start memtest in the morning.
9th November 09:38 Machine still running fsck from the previous day and many errors found in file system. Machine rebooted into memtest.
9th November 09:50 Informed Graeme Stewart of the severity of the problem, Savannah ticket setup to deal with file loss. List of files given to ATLAS.
9th November 14:15 The 34989 files that are available elsewhere are declared lost and ATLAS deletion service starts copying them back to RAL.
9th November 14:28 Memtest found no errors in memory and no errors in any log files. Machine brought up with read-only file systems and handed to Brian Davies to recover as many critical files as possible.
10th November 10:25 The 4437 files on /exportstage/castor2 declared lost as that partition is found to be empty.
10th November 16:05 Brian has managed to recover 595 files which have been put back into production. The remaining 8005 files on the disk server are declared lost.
15th November 77 files remain on disk server. After consultation with ATLAS these are found to be dark data (which is why they wern't cleaned up). Disk server is completely empty.

Incident details

Disk Server GDSS398 is part of the AtlasDataDisk space token (D1T0) initially gave 'FSProbe' errors on the Sunday morning. This was not picked up until the Monday morning when the systems was removed from production. Investigations showed the file system had become corrupt, with one of the three Castor partitions being empty after the 'fsck' check had run. Atlas were informed promptly of the problems, but most of the files on the server had to be declared as lost.

Analysis

The initial failure of the server occurred over the weekend. However, there was no call-out on that failure (FSProbe) and the server remained in production for around 24 hours before being taken out of service.

This failure has been identified as one of a series that have occurred within the same batch of disk servers. The RAL Tier1 has around 60 of these servers that make use of RAID 6 for the data array. The current analysis is that a failure (or other problem) of a single disk drive is not handled correctly by the raid controller. This in turn leads to corruption of the file systems. The corruption is severe and cannot be recovered (for example by fsck).

The operational response to the incident and liaison with the Atlas experiment worked well. However, analysis after the event has questioned whether running 'fsck' on a file system that is known to have some corruption may reduce the number of files that can subsequently be recovered.

Follow Up

Issue Response Done
No callout on initial FSProbe error Instigate a call-out on this error. Consider whether further automatic actions, such as setting the file systems read-only, should be triggered by this condition. Yes
Interaction with the Atlas experiment dependent on one or two experts. Document and test procedures for dealing with data issues for Atlas and ensure these can be used by those members of staff providing on-call cover. Yes
Following a failure leading to the corruption of a file system running 'fsck' may make exacerbate the problem. Review handling of disk servers that fail with FSProbe or similar errors with consideration given as to whether the running of a file system check (fsck) is appropriate or if another procedure should be adopted. Yes
This server failed in a catastrophic manner leading to data loss. A set of failures of disk servers, resulting in data loss, from the same batch of servers has been identified. All servers in this batch have been removed from production so that analysis and investigation of the root cause can be investigated without affecting services. Yes

Related issues

Some months earlier there had also been data loss following the failure of an Atlas disk server (GDSS417). Several actions were generated as a result. This time the interaction with the Atlas experiment was more rapid, resulting in an earlier declaration of data as lost thus enabling quicker recovery. However,this failure was also from the same batch of disk servers. See:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100801_Disk_Server_Data_Loss_Atlas

Around two weeks after this server failure another one (GDSS391) from the same batch of servers also failed with data loss. See:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101121_Atlas_Disk_Server_GDSS391_Data_Loss

Reported by: Gareth Smith. 12th November 2010

Summary Table

Start Date 7 November 2010
Impact >80% of files on server lost.
Duration of Outage 4 days
Status Closed
Root Cause Hardware
Data Loss Yes