Difference between revisions of "RAL Tier1 Incident 20101121 Atlas Disk Server GDSS391 Data Loss"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 15:43, 28 May 2013

RAL Tier1 Incident 21st November 2010: Atlas Disk Server GDSS 391 Failed with Data Loss

Description:

FSProbe reported errors on a disk server. On investigation file systems had become corrupt.

Impact

The disk servers contained 27339 files. Of these some 14930 were unique. It was subsequently possible to recover 337 files from the server. However, all remaining files were declared lost. This disk server loss occured towards the end of the ATLAS autumn re-processing run and without a rapid response RAL would have delayed the end of the entire campaign.

Timeline of the Incident

When What
21st November 03:08 FSProbe first reports problem.
21st November 10:00 Alastair (who is monitoring ATLAS re-processing) notices failures on ATLAS dashboard, disk server has gone read-only. Contacts Ian who is on-call who takes the disk server out of production. List of affected files is generated.
21st November 11:49 ATLAS informed via ELOG about disk server being removed from production.
21st November 17:00 ATLAS submit savannah tickets saying that the re-processing campaign is being effected by the loss of the disk server. Re-processing needs to be finished by 26th November!
22nd November 09:20 Machine found to have file system errors. Hardware investigations started. No immediately obvious hardware errors found. Machine came up and started fsck due to unclean file systems.
22nd November 11:58 Memtest started
22nd November 13:55 Memtest passed. Manual fsck started as automatic fsck failed but did not fix the file system.
22nd November 14:19 All non-unique files are declared lost. 14930 unique files left on the disk server.
22nd November 18:30 Alastair contacts ADC experts informing them about problems with all SL08 disk servers.
22nd November 22:43 ATLAS identify 830 high priority files that need recovering in order to complete the re-processing.
23rd November 08:00 Alastair + Andrew S join ADC morning meeting. Decide to give RAL until after lunch to recover as many files as possible before declaring the rest lost and re-running them.
23rd November 15:06 Only 337 of the 830 high priority files were recoverable, the rest had bad checksums. The remainder of the files are declared lost.
 ?? November Decision taken to remove entire batch of servers from production.

Incident details

Disk Server GDSS391, part of the AtlasDataDisk space token (D1T0) reported problems in the early hours of Sunday morning, 21st November. The lack of call-out meant that the problems on the server were not noticed for some hours, when the server was taken out of production.

The server did not crash, but on investigation showed file system errors. File corruption was found and data loss reported to the VO (Atlas).

This is part of the same batch of disk servers as GDSS398 that failed recently (see linked ticket).

Analysis

This is another failure of the same batch of disk servers as in the linked post mortem. Detailed analysis and mitigation is included there. See:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101108_Atlas_Disk_Server_GDSS398_Data_Loss

Follow Up

As this incident is very similar to the failure of GDSS398, refer to that post mortem for follow-up actions. See:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101108_Atlas_Disk_Server_GDSS398_Data_Loss

Related issues

Some months earlier there had also been data loss following the failure of an Atlas disk server. Several actions were generated as a result. This time the interaction with the Atlas experiment was more rapid, resulting in an earlier declaration of data as lost thus enabling quicker recovery. See:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100801_Disk_Server_Data_Loss_Atlas

There had also been another incident from this same batch only a couple of weeks earlier. See:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101108_Atlas_Disk_Server_GDSS398_Data_Loss


Reported by: Gareth Smith. 12th November 2010

Summary Table

Start Date 21 November 2010
Impact >80% of files on server lost.
Duration of Outage 2 days
Status Closed
Root Cause Unknown
Data Loss Yes