RAL Tier1 Incident 20130219 Disk Server Failure File Loss

From GridPP Wiki
Revision as of 09:58, 10 April 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Incident 19th February 2013: Disk Server Failure File Loss.

Description:

Multiple disk failures on a server led to a loss of data. A total of 68 files were lost, all of which belonged to T2K. The sever was part of a tape backed service class and these 68 files were those that had not been migrated to tape at the time of the failure.

Impact

The disk server forms part of the GenTape service class which is tape backed (D0T1). Most of the files on the server are available from tape. However, any files copied onto the server that had not been copied to tape by the time the server went down were not available to the VO.

During the first outage, following multiple disk failures, the server was unavailable for 2 days. There were a number of unmigrated files on the server at the time of the failure. These would not have been available to the VO during the time it was unavailable, although the VO concerned deleted the files during the server outage.

Following the second failure there were a total of 68 files which were reported as being unavailable for 7 days (12 - 19 February) until being finally reported as unrecoverable. The VO reported that of the 68 files lost, 36 were replicated elsewhere. The remainder are stages of the MC output for which the jobs can easily be rerun.

Timeline of the Incident

When What
August 2011 GDSS594 deployed into GenTape Service Class
7th September 2012 Checksum checker reported a single corrupt file on the server. This was a second copy of a file elsewhere in Castor and the garbage collector was left to clean it up. (RT#102530)
22nd January 2013 11:00 KH: Found multiple disk failures on GDSS594 through mimic and Adaptec Storage Manager and request Castor team to take it out of production. (port 7, 23 and 5)
22nd January 2013 11:06 GDSS594 taken out of service to fix problem of multi-disk failures.
22nd January 2013 13:16 KH: Replaced faulty drive in port 7 which trigger automatic rebuild on the hostpare drive.
23rd January 2013 09:21 Rebuild completed successfully on port 7 but the controller was still showing port 7 as a spare drive. Updated the firmware on the RAID card controller.
23rd January 2013 11:48 DZ: Replaced 2nd faulty drive in port 23.
23rd January 2013 12:26 KH: Added port 23 as a hotspare drive and rebuild started automatically on port 23.
24th January 2013 07:48 Rebuild completed on port 23.
24th January 2013 08:50 KH: Replaced 3rd and last faulty drive in port 5 and added as a hotspare drive. Also informed Production team and Castor team that the RAID array is stable and system is ready to go back to production.
24th January 2013 11:00 Rob: Put gdss594 back into production after running checksum check.
12th February 2013 16:40 Adaptec Storage manager Logs: "Logical device is suboptimal: controller 1, logical device 1 ("Device 1")."
12th February 2013 17:00 Kernel logging hardware problem: "sdb: Current: sense key: Hardware Error"
12th February 2013 17:03 Kernel starts logging filesystem problems. Eg: "Filesystem sdb1: xfs_log_force: error 5 returned."
12th February 2013 17:24 Callout for readonly filesystem on host gdss594
12th February 2013 18:19 Primary On-Call has removed server from production and updated status.
12th February 2013 18:48 Fabric On-Call has found multiple drives failures on host gdss594 during investigation. (port 11, 13 and 17)
12th February 2013 19:01 Fabric On-Call indicate that the host gdss594 have recently had multiple drives failures - Port 7, 23 and 5 (RT #107058). At that time the RAID array was recovered successfully.
12th February 2013 19:11 Fabric On-Call inform POC about the situation and discuss about the next plan for the intervention.
13th February 2013 10:00 DZ: In error, drive in port 11 replaced with drive from gdss593 port 23, instead gdss611 as stated in the wiki. Rebuild does not start.
13th February 2013 10:10 DZ puts the drive in gdss594 port 11 back to to gdss593 port 23.
13th February 2013 10:15 DZ removes drive from gdss611 port 18 and install it in gdss594 port 11. Rebuild does not start.
13th February 2013 10:20 DZ realises wiki is wrong and gdss611 is a production machine. Drive from gdss594 port 11 is reinserted back to gdss611 port 18. Rebuild in gdss611 starts.
13th February 2013 10:25 DZ attempts again with drive from gdss593 p23 to gdss594 p11. Rebuild does not start.
13th February 10:45 DZ reboots gdss594. Rebuild does not start. Controller prints array failed due to three missing members. (GDSS593 is from a different batch of disk servers, but the disk drives are identical).
13th February 11:00 DZ re-adds original failed drive to port 11 of gdss594. Drive is now seen present but cannot be brought online.
13th February 13:30 DZ moves all original raid6 drives from gdss594 to gdss607 and forces the array online via arcconf setstate 1 logicaldrive 1 OPTIMAL. Rebuilding starts
14th February 9:30 DZ checks the partition table of the array and seems intact
15th February 13:00 DZ mounts the partitions and then instantly umounts them. xfs_check is unable to run on sdb1, finds errors in sdb2, and locks up the machine when accessign sdb3.
15th February 13:30 DZ reboots the system and rebuilding of the array resumes. DZ becomes aware of the long history of problems associated with gdss607.
17th February 23:00 DZ Array has finished rebuilding sometime in Saturday. xfs_check freezes gdss607 while accessing sdb2. After reboot 3 drives are missing and array is failed.
18th February 10:00 DZ During fabric discussion it is pointed out by Martin Bly that adding possible problematic drives from gdss594 to a system with a long history of problems such as gdss607 has introduced unstable elements into the recovery attempt. Any further attempt by using another disk server risks causing damage to an otherwise working disk server. To avoid that risk and contain the damage to gdss594, it is decided that further efforts to recover by using other disk server should stop.
19th February 2013 11:45 T2K informed files not recoverable.
20th February 2013 12:34 E-mail response from T2K indicating impact of file loss to them.

Incident details

The disk server suffered two problems around a month apart.

The first problem was seen as multiple disk failures in ports 7, 23 and 5. Although three disk were found to have failed it was possible to rebuild the RAID array. As part of the recovery process the disk controller firmware was updated. It is unclear whether any of the disk drives had actually failed. The logs show the Adaptec disk controller reporting errors ("adapter reset request"; "Bad Block discovered") at least a day before the server was taken out of production.

Around three weeks later, early in the evening of Tuesday 24th February, the server failed with a read-only file system being reported. The server was taken out of production by the On-Call team as per usual practice. It was noted that there had again been three disk failures in the server (ports 11, 13 and 17). The fabric team investigated the problems on the server the following day.

That day one of the disks was replaced with a disk taken from another system (GDSS593). This failed to trigger a rebuild. A second disk was taken from another server (GDSS611) and inserted into the failed array. However, a documentation error meant that this disk was taken from another production systems and was removed again to be returned to its original server. However, in neither case (i.e. when the disk was first added, nor when removed and replaced with a further disk) did a RAID rebuild start.

In order to progress the recovery, and believing the system (GDSS594) may be faulty all the drives were inserted into a spare server of the same batch, GDSS607. GDSS607 was being used in a test environment (Castor "preprod") which both provides a testbed and a holding place for spare servers. A rebuild was triggered on the array in GDSS607. After the rebuild reported it was complete the server was rebooted. However, the RAID array then reported three failed drives in the array. Unknown to the person making the change was that GDSS607 has itself had a series of problems and whilst working up to the time of this intervention may not have provided a good alternative host for the disks in the RAID array.

It was found that while the RAID array was reported it was rebuilding in GDSS607 the server was unstable. An attempt to check the status of the filesystems during the rebuild led to a server locking up. It was therefore necessary to wait for the rebuild to complete before validating the filesystems. As these were then found to be corrupt this led to a delay in announcing to the VO that the the files were unrecoverable.

The VO (T2K) reported that of the 68 files lost, 36 are replicated elsewhere. The remainder are stages of the MC output for which the jobs can easily be rerun.

Analysis

The server is one of a batch of 18. The data is held on a RAID 6 array made up of 21 disks plus a hot spare. The total capacity of the data array is 36 TB. There are also a mirrored pair of disks for the operating system.

On 22nd January the Nagios tests reported a single failed disk drive. On investigating it was found that the RAID controller card was reporting three failed disks. The server was taken out of service and it was possible to recover the RAID array. Analysis of the logs shows the Adaptec RAID controller was reporting problems for at least a day before the server was taken out of production.

Once the above problem was resolved the server ran successfully for around three weeks until it failed with a read-only file system on 12th February. The Nagios check for this condition alerted the on-call team and the server was taken out of service. An analysis of the Nagios and system logs for the hours before the failure has been made. Relevant points from these logs have been included in the timeline above. The logs confirm that the system was not reporting failed disk(s) or other filesystem / RAID problems until immediately before the system failed.

Owing to an operator error one of the failed disks was replaced twice. However, in neither case did the server attempt to rebuild the array and this is not considered to have affected the final outcome of the incident. At this point the disk controller was reporting three failed drives.

Efforts to recover the RAID array continued by moving all the disks into a spare identical server (GDSS607). The system did then start a rebuild of the array. The size of the array (36TB) leads to a very long rebuild time. Attempts to check the status of the disks partitions during the rebuild led to the server becoming unstable. For this reason the rebuild was allowed to complete before further checks (and before further updating the VO about the status of the files.) However, on completion of the rebuild the filesystems were not OK.

Server GDSS607 was running as part of a Castor test instance which also forms a pool of hot spares. However, unknown to the member of staff carrying out the work until part way through the process, GDSS607 itself had a history of problems. Use of this potentially faulty spare system may have compounded the problems of the RAID array. During subsequent analysis it was noted that previous attempts to recover an array by moving the disks to a replacement chassis have rarely been successfully. Replacing components such as the disk controller card has generally been more positive.

It is quite possible that whatever actions were taken the RAID 6 disk array with three failed drives could not have been recovered whatever actions were taken. However, it is possible that the use of GDSS607 which itself may have problems, could have contributed to the failure to recover the system and the data loss.

It is noted that in both the January & February incidents only odd-numbered disks were reported as having failed. This may indicate a backplane or chassis fault. However, it has not yet been possible to isolate such a fault. Alternatively, the errors found in the logs for the first event suggest there may have been a fault with the disk controller - possibly not fully resolved after the first intervention on 22nd January.

A check on the operational history of this batch of 18 servers does not show an undue rate of error. Only two servers from the batch to have shown significant faults. These are GDSS594 & GDSS607, both of which were referred to in this incident.

Follow Up

Issue Response Done
No call-outs for failed disks (or even for multiple failed disks) occurred. Although an analysis of the logs for the incident on the 12th February shows that there were no failed disks ahead of the server failure in this case, a validation that the tests on disk failures for this batch of servers are correctly configured - both to detect disk failures and to call out as appropriate - should be made. No
Important information was unknown to member of staff carrying out work. Check the appropriateness and timeliness of the documentation on disk server states and spares. Make the awareness of this information part of the induction for new members of the Fabric Team. No
No spare replacement server from the same batch was available at the time. Validate the spares for this batch of disk servers. Ensure there are both spare disks and a spare working server. The vendor for this batch of servers (Streamline) is no longer in business and aquiring spares is more difficult than for other batches of servers. No

Reported by: Gareth Smith 19th March 2013

Summary Table

Start Date 12th February 2013
Impact >80%
Duration of Outage 7 days
Status Open
Root Cause Hardware
Data Loss Yes