RAL Tier1 Incident 20110106 CMS Disk Server GDSS496 Data Loss

RAL Tier1 Incident 6th January 2011: CMS Disk Server GDSS496 Failed with Data Loss

Description:

Disk server GDSS496 (CMSFarmRead) a Disk0Tape1 service class failed on 6th January. Two files that had not been migrated to tape at the time of the failure were lost.

Impact

Two unmigrated CMS files were lost. One was from MC rereco, the other was from the Dec22 re-reconstruction pass of the Run2010A EG primary dataset.

Timeline of the Incident


When	What
6th January 12:31	Errors noted in logs.
6th January 14:25	Removed from Production.
6th January 16:17	Started memory tests after no errors found in system event logs, log files or RAID controller logs.
12th January 16:35	Checksums of unmigrated files found to be incorrect.
13th January 09:21	Unmigrated files announced as lost to CMS.
13th January 09:36	Confirmation given to fabric team that machine can be wiped and investigated thoroughly.
17th January 09:47	Logs sent to Streamline for analysis.
1st February 17:17	Vendor finds reason for array failure in logs - faulty drives and auto rebuild not on by default.
1st February 17:18	Begin fixing array.

Incident details

Disk server showed SCSI errors in operating system logs but no errors on RAID card. Files corrupted by unknown problem which vendor later identified as two drive failures plus a bad sector on a third drive.

Analysis

Two drives were "missing" from the RAID6 array and therefore when a bad sector/block was encountered on a third drive, the controller could not re-create the data from parity and data was lost. It was found that the controller had not rebuilt onto the hot spare when a drive had failed. This was due to a misconfiguration of the RAID controller which is being corrected on all machines of this generation.

Despite the above the Nagios monitoring on the server should have picked up the disk failures and this should have been noticed by staff enabling disks to be replaced in a timely manner. The server had only been in production several weeks but a check of the monitoring (Nagios) did not show any previous disk failures. However, other disk failures have been detected by the monitoring system for other servers from this batch.

Follow Up


Issue	Response	Done
Root Cause of failure	The root cause is understood.	Yes
Machine did not auto-rebuild onto hot spare.	Correct RAID controller configuration on this generation of machines to fix this problem.	Yes
Machine was allowed to remain in a degraded state	Understand when the disk failures occurred and why they were not picked up and acted on by the monitoring system.	Yes
Machine was allowed to get into a degraded state.	Check all other machines of this generation to ensure they do not currently have any degraded RAID arrays.	Yes

Related issues

There had been another disk server failure also resulting in data loss for CMS only a couple of weeks earlier.

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101225_CMS_Disk_Server_GDSS283_Data_Loss

Reported by: James Thorne on 8th February 2011

Summary Table


Start Date	6 January 2011
Impact	Data Loss
Duration of Outage	N/A
Status	Closed
Root Cause	Hardware / Configuration Error
Data Loss	Yes

RAL Tier1 Incident 20110106 CMS Disk Server GDSS496 Data Loss

Contents

RAL Tier1 Incident 6th January 2011: CMS Disk Server GDSS496 Failed with Data Loss

Description:

Impact

Timeline of the Incident

Incident details

Analysis

Follow Up

Related issues

Reported by: James Thorne on 8th February 2011

Summary Table

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools