RAL Tier1 Incident 20091130 RAID5 double disk failure
Double Disk Failure on Server Led to Data Loss
Site: RAL-LCG2
Incident Date: 2009-11-30
Severity: Severe
Service: CASTOR, LHCb
Impacted: LHCb
Incident Summary: Two disks failed in gdss138 (lhcbDst, disk 1 tape 0) within 30 minutes of each other, rendering the RAID5 data array inoperable. After attempts to recover the array failed, the data was declared as lost, see the timeline for full details
Type of Impact: Data Loss
Incident duration: 1 day
Report date: 2009-12-03
Reported by: James Thorne
Related URLs: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=53925 (Tier1 helpdesk account needed)
Incident details:
gdss138 is a disk 1 tape 0 disk server in the LHCb CASTOR instance at RAL. It is a Viglen 2006 machine and the data array comprises 13 500GB drives configured as RAID5 plus a hot spare. The RAID controller in the machine is not capable of RAID6. The machine had last verified its arrays on 16 November in the standard RAL array verify schedule. The machine was also running the fsprobe corruption detection tool which is also standard across the Tier1 disk farm.
At about 05:30 on 30 November 2009, a drive (port 13) failed in the data array. Unfortunately, it did not trigger a rebuild as the controller didn't throw the drive out of the array. Eventually the drive timed out and a degraded array event was logged. Before the card could start a rebuild a second drive failed, losing any parity that may have been able to rebuild the array. I do not believe that we missed any reasonable fix for the array. The timeline summarises attempts to recover the array.
Future mitigation:
Double disk failures seem to be increasingly likely on the RAID5 machines as they get older. The Tier1 is considering how we can migrate disk 1 tape 0 data from RAID5 systems to RAID6 systems as RAID6 can rebuild an array after two drive failures.
We have sent logs back to Viglen to see if they, 3ware or Western Digital have any suggestions that may help us to reduce the risk of this happening again. There are also ongoing investigations into disk failure rates led by Martin Bly and Ian Collier.
gdss138 has had the two failed drives (and one other that failed subsequently) replaced and the array re-initialized. It will re-run the Tier1 acceptance tests for seven days to weed out any further faulty drives before being put back into use.
Related issues:
The machine had recently run an array verify and was running fsprobe.
Timeline
Date | Time | Comment | |
---|---|---|---|
First drive shows problems | 2009-11-30 | 05:29:52 | Drive in port 13 encounters "device error" but does not trigger a rebuild. |
Fault first detected | 2009-11-30 | 05:29:52 | Nagios |
First drive times out | 2009-11-30 | 05:40:43 | Drive in port 13 timed out. Controller logged a degraded array but still did not start a rebuild. |
Second drive fails | 2009-11-30 | 06:07:32 | Drive in port 3 fails into state "not present". Array is now marked "inoperable". |
First Intervention | 2009-11-30 | 08:57:08 | Kashif Hafeez logs ticket in helpdesk and starts investigating.
Kashif's notes from the ticket:
|
Advisory Issued of Server Unavailable | 2009-11-30 | 09:24 | By Tiju Idiculla (Admin On Duty) to Raja Nandakumar via email. |
Advisory Issued of probable data Loss | 2009-11-30 | 12:34 | By Gareth Smith to Raja Nandakumar via email. |
Inform WLCG Daily Meeting | 2009-11-30 | 14:00 | Gareth Smith |
Advisory Confirming Data Loss | 2009-11-30 | 16:57 | By Gareth Smith to Raja Nandakumar via email. |
Machine re-certification started | 2009-12-03 | 11:47 | All faulty drives replaced and array re-initialization started |
Fault Fixed | 2009-12-08 | 14:28 | Machine has had drives replaced, array rebuilt from scratch and 7 days of testing. |
Announced as Fixed | 2009-12-08 | 14:28 | Announced to CASTOR team at RAL. Machine will be redeployed in non-disk-only service class. Machine now entering deployment process. |