Difference between revisions of "RAL Tier1 Incident 20091130 RAID5 double disk failure"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 12:24, 14 December 2009

Double Disk Failure on Server Led to Data Loss

Site: RAL-LCG2

Incident Date: 2009-11-30

Severity: Severe

Service: CASTOR, LHCb

Impacted: LHCb

Incident Summary: Two disks failed in gdss138 (lhcbDst, disk 1 tape 0) within 30 minutes of each other, rendering the RAID5 data array inoperable. After attempts to recover the array failed, the data was declared as lost, see the timeline for full details

Type of Impact: Data Loss

Incident duration: 1 day

Report date: 2009-12-03

Reported by: James Thorne

Related URLs: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=53925 (Tier1 helpdesk account needed)

Incident details:

gdss138 is a disk 1 tape 0 disk server in the LHCb CASTOR instance at RAL. It is a Viglen 2006 machine and the data array comprises 13 500GB drives configured as RAID5 plus a hot spare. The RAID controller in the machine is not capable of RAID6. The machine had last verified its arrays on 16 November in the standard RAL array verify schedule. The machine was also running the fsprobe corruption detection tool which is also standard across the Tier1 disk farm.

At about 05:30 on 30 November 2009, a drive (port 13) failed in the data array. Unfortunately, it did not trigger a rebuild as the controller didn't throw the drive out of the array. Eventually the drive timed out and a degraded array event was logged. Before the card could start a rebuild a second drive failed, losing any parity that may have been able to rebuild the array. I do not believe that we missed any reasonable fix for the array. The timeline summarises attempts to recover the array.

Future mitigation:

Double disk failures seem to be increasingly likely on the RAID5 machines as they get older. The Tier1 is considering how we can migrate disk 1 tape 0 data from RAID5 systems to RAID6 systems as RAID6 can rebuild an array after two drive failures.

We have sent logs back to Viglen to see if they, 3ware or Western Digital have any suggestions that may help us to reduce the risk of this happening again. There are also ongoing investigations into disk failure rates led by Martin Bly and Ian Collier.

gdss138 has had the two failed drives (and one other that failed subsequently) replaced and the array re-initialized. It will re-run the Tier1 acceptance tests for seven days to weed out any further faulty drives before being put back into use.

Related issues:

The machine had recently run an array verify and was running fsprobe.

Timeline

Date Time Comment
First drive shows problems 2009-11-30 05:29:52 Drive in port 13 encounters "device error" but does not trigger a rebuild.
Fault first detected 2009-11-30 05:29:52 Nagios
First drive times out 2009-11-30 05:40:43 Drive in port 13 timed out. Controller logged a degraded array but still did not start a rebuild.
Second drive fails 2009-11-30 06:07:32 Drive in port 3 fails into state "not present". Array is now marked "inoperable".
First Intervention 2009-11-30 08:57:08 Kashif Hafeez logs ticket in helpdesk and starts investigating.

Kashif's notes from the ticket:

  1. Powered off the system from command line and pull the power cables out.(9:15)
  2. Waited few minutes and put the power cables back in and turned on system. (9:17)
  3. System didn't boot and went into file system recovery mode.(9:20)
  4. Asked James T for help. (9:30)
  5. James T and myself went to HPD room and found no sign of (Port 3) failure in logs. (9:43)
  6. Powered off system. (9:44)
  7. Pulled out the drive in port 3 and put it back. (9:45)
  8. Booted in single user mode (9:50)
  9. System didn't boot and again went into to file system recovery mode.(9:51)
  10. Commented out /dev/sd6* in /etc/fstab. (9:53)
  11. Booted in single user mode. (9:55)
  12. grab dmesg errors to file (3ware-dmesg-errors-20091130) - (10:00)
  13. Started networking and ssh at (10:01)
  14. Informed castor on duty and Production on duty.
  15. Reseated and checked cables inside the system. (didn't work)
  16. Replace the 16 ports raid card from gdss87. (didn't work)
  17. Swapped all hard drives including system drives with gdss87. (didn't work)
  18. Attached the drive (port 3) with drive tester.
  19. The drive is giving clunk and also system can't find the drive.
  20. Probably drive is dead.
  21. Added new drive in port 3. (didn't work)
  22. System was showing all drives as new.
  23. Informed Production team.
Advisory Issued of Server Unavailable 2009-11-30 09:24 By Tiju Idiculla (Admin On Duty) to Raja Nandakumar via email.
Advisory Issued of probable data Loss 2009-11-30 12:34 By Gareth Smith to Raja Nandakumar via email.
Inform WLCG Daily Meeting 2009-11-30 14:00 Gareth Smith
Advisory Confirming Data Loss 2009-11-30 16:57 By Gareth Smith to Raja Nandakumar via email.
Machine re-certification started 2009-12-03 11:47 All faulty drives replaced and array re-initialization started
Fault Fixed 2009-12-08 14:28 Machine has had drives replaced, array rebuilt from scratch and 7 days of testing.
Announced as Fixed 2009-12-08 14:28 Announced to CASTOR team at RAL. Machine will be redeployed in non-disk-only service class. Machine now entering deployment process.