RAL Tier1 Incident 20100801 Disk Server Data Loss Atlas

From GridPP Wiki
Jump to: navigation, search

Failure of Atlas Disk Server Resulted in Data Loss

Report In Preparation

Description:

The failure of a single disk within the RAID 6 disk server (GDSS417) led to the server failing. When initially checked it appeared the RAID array had been lost. Subsequent investigations, and additional information received, enabled the RAID array to be recovered. However, the disk server was found to be unstable and could not return to production.

Atlas provided a list of important files on the server, and a total of 4000 files were manually copied off while the server was out of production, including most of the files Atlas determined to be important - however doing this took time and required significant manual effort. As a result, in agreement with Atlas it was decided that the remaining files, which were replicated at other sites, could be discarded and re-transferred. The remaining (43000) files were declared as lost.

Impact

The disk server was unavailable for an extended period (10 days). The UK was blacklisted from Atlas production from 9th to 12th August. The list of important files were successfully recovered from the server, while the remaining files were marked as lost. The disk server (gdss417) was part of the Atlas MCDisk space token a D1T0 service class.

Timeline of the Incident

When What
Sunday 1st August - 06:26 Primary On-Call received notification of server down.
Monday 2nd August - 09:01 Kashif checked the disk server and found kernel panic messages on the screen.
Monday 2nd August - 09:54 A single drive was missing from the raid array.(Port 11)
Monday 2nd August - 09:58 Couldn't boot the system as an error occurred during file system check.
Monday 2nd August - 10:18 Booted system with unmounted file system and without fsck. but the controller was unable to see the data array.
Monday 2nd August - 14:35 Reported the fault to vendor.
Tuesday 3rd August - 10:46 Replaced drive in port 11, which came up as a new drive instead of rebuilding.
Tuesday 3rd August - 15:20 All Data declared lost to ATLAS
Tuesday 3rd August - 20:30 RAL production queue set offline, while ATLAS clean up data.
Wednesday 4th August - 09:40 John Bland (Liverpool) gets in touch having seen similar problem with Areca 1280 RAID cards.
Wednesday 4th August - 11:44 James Thorne managed to see the raid array by activating it. Rebuild started on port 11.
Wednesday 4th August - 12:15 ATLAS informed data not lost, disk server should be back in production by Friday. ATLAS told it might be possible to get access to some high priority files sooner.
Thursday 5th August - 14:15 ATLAS provide us with list of high priority files. Alastair and Brian copy first few back into production by hand. Brian also starts to copy files also seen in ATLAS Dashboard as failing due to "Locality is Unavailable"
Friday 6th August - 12:00 gdss417 returned to production. Immediately crashes. ATLAS informed files won't be available until middle of next week.Brian continues creating list of files form dashboard and manually moving files via "scp to UI" method.
Monday 9th August - 14:45 Alastair gives presentation to ADC. UK blacklisted for ATLAS production jobs. Won't be put back into production until disk server problems resolved.
Tuesday 10th August - 12:00 gdss417 put into draining mode in castor. After copying some 4000 files off back into production, disk server crashes again. Remaining files declared lost to ATLAS.
Wednesday 11th August - 10:00 Alastair requests that RAL is put back into production. Test jobs submitted, they succeed. RAL and rest of UK is removed from blacklist the following day.
Tuesday 17th August -10:00 Still receiving GGUS tickets regarding files which are "Lost" and "User Error"
Monday 13th September - 15:28 Castor team (Chris) confirms that any data left on disk server can be destroyed. System is ready for intervention.
Friday 17th September - 16:05 Updated firmware as per Vendor's suggestion. Re-created raid array and started initializing drives.
Monday 20th September 09:32 Re-installed file system and partitions. Started acceptance test.
Friday 22nd October 14:37 Disk server is still in testing.(34 days) Removed one drive manually but system didn't crash. Probably will crash with genuine drive failure.

Incident details

Drive in port 11 of gdss417 failed and the machine kernel panicked for a reason that was unclear. On reboot, the machine could not see the data array. All data on the server was declared as lost to Atlas. However, additional information received (we note our thanks to John Bland at Liverpool for this pointer) enabled the array to be recovered by "reactivating" it in the RAID controller command line interface. On rebooting the data was present but the server was found to be unstable. After reading a number of files, the machine crashed again and we suspect that the file system has become damaged during the initial crash. Tier1 staff recovered all "critical" files (as determined by Atlas) from the machine, the remainder were declared lost.

Analysis

The reason for gdss417 losing a RAID6 data volume when a single drive failed is not yet fully understood. Both Areca (the RAID card manufacturer) and the Vendor suggest that it is a firmware problem and that the RAID controller firmware should be updated to the latest release. We have not yet had a definitive statement from them that it will indeed prevent this problem. We are actively pursuing a solution with the Vendor and have asked them to treat this issue with high priority, particularly as we have now seen this behaviour on two other machines at the Tier1.

Following the initial failure all data on disk server GDSS417 was declared to the VO (Atlas) as lost. However, it was subsequently learnt that data could be recovered. Atlas were informed of the new situation. This has already resulted in conflicting information be supplied to Atlas. Furthermore, the server was then found to be unstable. Recovering the data then became a lengthy, time consuming process. The result was a protracted outage with confusing information being supplied to Atlas. Communications with Atlas were good and enabled the critical files to be copied elsewhere. Nevertheless the protracted length of the problem negated much of the benefit of recovering some of the files.

Follow Up

Issue Response Done
Failure of single disk drive caused the server to fail. This may be a particular failure of the hardware in GDSS417. However, need to assess if there is a general problem with the interaction between the disks and RAID controllers in this batch of disk servers that causes this. Follow up with vendor to obtain/test/implement and necessary changes (e.g. firmware). Subsequent failures of disk servers have indicated a problem with a particular batch of disk servers that is being tracked. As of end November 2010 the problem is being pursued with the vendor. The entire batch of servers is being taken out of production while a fix is awaited. Yes
Data Recovery became possible on receipt of further guidance. Modify procedures to include re-activating the RAID array as an additional method of trying to recover data. Done yes/no
Protracted exchange of information with Atlas led to confusion. The point at which data is declared lost depends, in part, on agreement between the RAL Tier1 and the VO concerned. The RAL Tier1 has been very cautious regarding data preservation, although this may not always be what is wanted by the VO (Atlas in this case). The RAL Teir1 to modify its approach to VOs (e.g. Atlas) to be prepared to consider data loss earlier in the intervention procedure. Yes

Related issues

The following Post Mortem also considers change to Disk Intervention Procedures:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100916_Second_Failure_of_Disk_Server-CMS_data_loss

There were two subsequent disk server failures from the same batch of disk servers, both of which resulted in data loss for Atlas. The Post Mortems can be seen at:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101108_Atlas_Disk_Server_GDSS398_Data_Loss and

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101121_Atlas_Disk_Server_GDSS391_Data_Loss


Reported by: Gareth Smith 15th October 2010

Summary Table

Start Date 1 August 2010
Impact 100% unavailability of this server.
Duration of Outage 10 days.
Status Open
Root Cause Hardware
Data Loss Yes