RAL Tier1 Incident 20110330 Disk Server GDSS502 Data Loss T2K

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 30th March 2011: Disk Server GDSS502 Failed with Data Loss for T2K

Description:

The failure of a disk server from a D0T1 service class resulted in the loss of a T2K data file.

Impact

The failure of a disk server in a tape backed service class led to the loss of a single file for T2K. The remaining un-migrated files on the server were successfully copied to tape.

Timeline of the Incident

When What
3rd February 2011 First became aware of failed stripes in the RAID array.
16th February 2011 Still aware of Failed Stripes. Trying a verify of the RAID array.
21st February 2011 Verify does not resolve failed stripes in the RAID array.
28th March 2011 Errors reported on disk device
28th March 2011 09:30 System removed from production. (767 un-migrated files on it)
28th March 2011 10:20 System rebooted. Put into draining mode.
31st March 2011 Draining complete. All possible files migrated off.
31st March 2011 16:12 Loss of single file recorded in Tier1 ELOG.
31st March 2011 16:22 E-mail sent to T2K informing them of the loss of the file.
1st April 2011 15:00 Re-creation of RAID arrays.
5th April 2011 Finished initializing.RAID arrays. Badblocks found. reporting some faulty drives.

Incident details

Disk server GDSS502 is in the Castor GENTape D0T1 service class in Castor and contains files belonging to T2K and other non-LHC experiments.

On 28th March 2011 the server reported errors on its disk subsystem. Following a short intervention (a reboot) the server was no longer showing errors. Nevertheless the server was drained with the remaining un-migrated files on the server copied to tape. However, there was one file that could not be copied off the disk and that was reported as lost to T2K on 31st March.

The raid card in the system had been reporting failed stripes.

Analysis

Following a check of the configuration of the RAID controller cards a batch of disk servers (including GDSS502) for another reason (see related issue below) it was noted that four servers were reporting failed stripes in the RAID subsystems.

Following a reported fault in GDSS502 the decision was taken to remove the server from production on order to rebuild the RAID array and resolve the failed stripes anomaly. This operation was carried out and 766 of the 767 un-migrated files on the server were successfully copied to tape. However, one file could not be recovered from the disk server and was declared as lost to T2K.

It is not known if the failed stripes error in the RAID controller was the main factor in the data loss but this is likely. This warning was present in three other disk servers. One of these systems had already been rebuilt. Following this incident the remaining two systems were drained and their RAID arrays rebuilt before being put back into production.

A power failure that occurred soem months earlier was thought to be a likely trigger for the RAID errors ("failed stripes"). However the four machines are connected as follows:

Server PDU Circuit Phase
gdss481 F2 21 C
G2
gdss488 F3 21 B
G3
gdss496 F3 21 B
G3
gdss502 F3 21 A
G2

This indicates that they were not all affected by the power outage. The four machines are however located in the same rack triplet, but this is not thought to be a contributory factor.

Investigations afterwards showed that the battery back-up for the RAID card cache was not enabled for this batch of disk servers. The hardware checks required before disk servers are deployed are not documented.

The "failed stripes" error is indicated by the Adaptec RAID controllers. The Adaptec management software shows systems with this error (amongst others) and we are now confident that it is not present on other servers. However, the error is not reported by the Nagios tests used to check the status of the Adaptec RAID cards. Other manufacturer's RAID cards report errors differently and the locally written Nagios tests that is run on these systems reports all error conditions.

RAID array verifications are automatically run regularly on disk servers to flush out consistency errors in the RAID arrays. However, these did not play a part in this incident.

Follow Up

Issue Response Done
Root cause Although we cannot be certain it is most likely that the cause of the problem was the RAID inconsistency reported as the 'failed stripes' error. This in turn was most likely triggered by the power failure compounded by the battery backup of the RAID cache memory not being enabled. Yes
Ensure these errors are acted upon quickly. Modify procedures such that any server showing a "failed stripes" error is quickly drained, No
Check for 'failed stripes' in any other disk servers. The Adaptec software has been used to validate the servers with Adaptec controllers. Errors on disk servers with other controller cards are picked up by the Nagios tests. Yes
Introduce an automated check for this 'failed stripes' on the Adaptec RAID cards. Modify the Nagios tests use for disk servers with these RAID controllers. No
Reduce effect of power failure on the disk servers. It is planned that disk servers will be powered by the UPS. This should be implemented. No
Could other systems be mis-configured with the battery backup for the RAID cache disabled. A check has been made on existing systems. Documentation should be set-up to include this check as part of the Disk Server deployment process. No

Related issues

A previous problem of a mis-configured RAID card setting has been found and resolved on this batch of disk servers. See: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110106_CMS_Disk_Server_GDSS496_Data_Loss

Reported by: Gareth Smith on 7th April 2011

Summary Table

Start Date 3rd February 2011
Impact Single File Data Loss
Duration of Outage N/A
Status Open
Root Cause Power Failure & Configuration Error
Data Loss Yes