RAL Tier1 Incident 20110106 CMS Disk Server GDSS496 Data Loss

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 6th January 2011: CMS Disk Server GDSS496 Failed with Data Loss

Description:

Disk server GDSS496 (CMSFarmRead) a Disk0Tape1 service class failed on 6th January. Two files that had not been migrated to tape at the time of the failure were lost.

Impact

Two unmigrated CMS files were lost. One was from MC rereco, the other was from the Dec22 re-reconstruction pass of the Run2010A EG primary dataset.

Timeline of the Incident

When What
6th January 12:31 Errors noted in logs.
6th January 14:25 Removed from Production.
6th January 16:17 Started memory tests after no errors found in system event logs, log files or RAID controller logs.
12th January 16:35 Checksums of unmigrated files found to be incorrect.
13th January 09:21 Unmigrated files announced as lost to CMS.
13th January 09:36 Confirmation given to fabric team that machine can be wiped and investigated thoroughly.
17th January 09:47 Logs sent to Streamline for analysis.
1st February 17:17 Vendor finds reason for array failure in logs - faulty drives and auto rebuild not on by default.
1st February 17:18 Begin fixing array.

Incident details

Disk server showed SCSI errors in operating system logs but no errors on RAID card. Files corrupted by unknown problem which vendor later identified as two drive failures plus a bad sector on a third drive.

Analysis

Two drives were "missing" from the RAID6 array and therefore when a bad sector/block was encountered on a third drive, the controller could not re-create the data from parity and data was lost. It was found that the controller had not rebuilt onto the hot spare when a drive had failed. This was due to a misconfiguration of the RAID controller which is being corrected on all machines of this generation.

Despite the above the Nagios monitoring on the server should have picked up the disk failures and this should have been noticed by staff enabling disks to be replaced in a timely manner. The server had only been in production several weeks but a check of the monitoring (Nagios) did not show any previous disk failures. However, other disk failures have been detected by the monitoring system for other servers from this batch.

Follow Up

Issue Response Done
Root Cause of failure The root cause is understood. Yes
Machine did not auto-rebuild onto hot spare. Correct RAID controller configuration on this generation of machines to fix this problem. Yes
Machine was allowed to remain in a degraded state Understand when the disk failures occurred and why they were not picked up and acted on by the monitoring system. Yes
Machine was allowed to get into a degraded state. Check all other machines of this generation to ensure they do not currently have any degraded RAID arrays. Yes

Related issues

There had been another disk server failure also resulting in data loss for CMS only a couple of weeks earlier.

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101225_CMS_Disk_Server_GDSS283_Data_Loss


Reported by: James Thorne on 8th February 2011

Summary Table

Start Date 6 January 2011
Impact Data Loss
Duration of Outage N/A
Status Closed
Root Cause Hardware / Configuration Error
Data Loss Yes