RAL Tier1 Incident 20081027 Data loss after multiple disk failure
Site: RAL-LCG2
Incident Date: 2008-10-27
Severity: Field not defined yet
Service: CASTOR
Impacted: ATLAS simStrip
Incident Summary: Disk server gdss154, configured as RAID5 plus hotspare, lost two drives. Replacement drives were ordered but then no-one noticed that they had not arrived and been installed. A week later, a third drive failed and rendered the array inoperable.
Type of Impact: Data Loss
Incident duration: 11 Days
Report date: 2008-11-07
Reported by: James Thorne, Tier1 Fabric Team
Related URLs: Ganglia plots for gdss154, RAL Tier1 Incident 20081102
Incident details:
Detailed timeline of events:
Date | Time | Who/What | Entry |
---|---|---|---|
17/10/2008 | 22:16:16 | syslog | Drive 14 failed.
|
17/10/2008 | 22:28:53 | Nagios | Issued soft alarm:
|
17/10/2008 | 22:35:26 | Nagios | Issued hard alarm:
|
20/10/2008 | James Thorne/James Adams | Nagios ndo2db daemon restarted twice (some alarms may have been lost) | |
21/10/2008 | 15:50:00 | James Adams | Noticed that both drives 13 and 14 had failed. No log messages for drive 13. |
21/10/2008 | 16:22:05 | syslog, James Adams | Drives removed from ports 13 1nd 14:
|
21/10/2008 | 17:00:00 | James Adams | Reported drives 13 and 14 had failed to Viglen and requested replacements. This was in a mail with other reported faulty drives. |
23/10/2008 | 08:36:00 | Jonathan Wheeler | Acknowledged Nagios alarm |
27/10/2008 | 04:37:56 | syslog | Drive in port 5 failed, first device errors in syslog:
These errors continued until the machine was taken out of service. |
27/10/2008 | 05:04:48 | Nagios | Issued alarm for fsprobe errors:
|
28/10/2008 | 09:59:56 | Brian Davies | Creates helpdesk ticket 36373 |
28/10/2008 | 10:01:00 | Brian Davies | Report of ATLAS problems on gdss154 reaches CASTOR-OP mailing list |
28/10/2008 | 13:41:16 | Tim Folkes | Merged ticket 36379 into 36373 |
28/10/2008 | 13:54:39 | James Thorne | Merged tickets 36374 abd 36375 into 36373 |
28/10/2008 | 14:00:00 | James Thorne | Intervention started |
28/10/2008 | 14:00:00 | Gareth Smith | Initial report to WLCG daily operations meeting. |
28/10/2008 | 14:11:00 | Martin Bly | Contact Viglen to chase up drives |
28/10/2008 | 15:54:54 | James Thorne | Machine put into Nagios downtime |
28/10/2008 | 16:13:21 | Martin Bly | Tags machine as faulty in the Tier1 Mimic |
29/10/2008 | 09:45:00 | James Adams, James Thorne | Collect new drives from Stores |
29/10/2008 | 13:01:00 | Gareth Smith | Sent email to atlas-uk-comp-operations@cern.ch to highlight problem |
29/10/2008 | 14:50:00 | James Adams | Attempted to recover array but disk from port 5 just made a clunking noise when powered up. |
29/10/2008 | 16:30:00 | James Adams | Reports that there is no chance of recovering data |
29/10/2008 | 16:45:00 | James Adams, Kashif Hafeez | Failed drives replaced |
29/10/2008 | 16:51:21 | James Adams, Kashif Hafeez | Array initialize started |
29/10/2008 | 17:01:00 | Gareth Smith | Sent email to atlas-uk-comp-operations@cern.ch confirming that there is no change of recovering the data. |
30/10/2008 | 00:12:38 | 3ware log | Array initialize finished |
30/10/2008 | 14:50:00 | James Thorne | Repartitioned array and started acceptance testing to run for 7 Days |
30/10/2008 | 15:45:08 | James Thorne | Nagios downtime extended to cover 7-day testing period |
04/11/2008 | 17:56:27 | 3ware log | Drive in port 12 thrown out:
|
05/11/2008 | 11:00:00 | James Thorne | New drive ordered from Viglen |
06/11/2008 | 14:45:00 | James Thorne | Stopped testing |
07/11/2008 | 14:12:07 | Kashif Hafeez | Replacement drive inserted in port 12. |
The ganglia plots show that the host was still working after the first two disk failures (end of week 42). The third disk failed at the start of week 44:
File:Gdss154-load-graph.png File:Gdss154-net-graph.png File:Gdss154-cpu-graph.png File:Gdss154-mem-graph.png
Future mitigation:
The fabric team have implemented several changes as a result of this incident:
- A new Nagios test has been deployed which has easier to interpret output, making double drive failures easier to spot. The two double disk failure events recently were hard to spot as there was not a new alarm for the server for the second disk as the previous alarm was still CRITICAL.
- Similar Nagios tests are being developed for Areca and other non-3ware systems.
- There is now a "Fabric Dashboard" on the Tier1 helpdesk showing hardware tickets that haven't been updated for 2 days, making tickets that have stalled due to components (e.g. disks) not showing up to be spotted easily and chased up. this is likely to be an interim solution as there is some discussion of making the helpdesk actively escalate hardware problems.
- There have been several changes to the Fabric Team's written incident response procedure:
- If replacement disks have not been received after 2 working days then the issue is escalated to urgent with the appropriate vendor.
- If there is a second disk failure during a rebuild, a disk from a "spare" system should be inserted as a hotspare and the vendor notified of the serial number for their records.
- Team members will not acknowledge a disk failure alarm until it has been fixed.
- The fabric team will ask vendors if we can have some spare disks to hold on site rather than using drives from spare systems in an emergency.
- Double disk failures will be added to the alarms for call out.
Related issues:
There was a second double disk failure in a machine of the same generation (RAL Tier1 Incident 20081102). It is not thought to be linked to this failure but see the notes under related issues in the incident report.
Timeline
Date | Time | Comment | |
---|---|---|---|
Actually Started | 2008-10-17 | 22:16 | Disk in port 14 failed. Nagios alarm for port 14. Systems administrator noticed that there was a failed drive in port 13 but no logs or alarm. Replacement drives requested on 2008-10-21. |
Fault first detected | 2008-10-28 | 10:01 | Brian Davies noticed ATLAS transfers to gdss154 were failing and alerted CASTOR team. |
First Advisory Issued | 2008-10-08 | 14:00 | Gareth Smith initally announced the problem at the WLCG daily operations meeting. |
First Intervention | 2008-10-28 | 11:00 | Host removed from CASTOR service. |
Fault Fixed | 2008-11-07 | 14:12 | Array rebuilt, tested and ready for production. Machine will be redeployed to a different service class. |
Announced as Fixed | How, to who | ||
Downtime(s) Logged in GOCDB | n/a | n/a | None |
Other Advisories Issued | 2008-10-29 | 13:01 | Email to atlas-uk-comp-operations from Gareth Smith announced the problem. |
Other Advisories Issued | 2008-10-29 | 17:01 | Email to atlas-uk-comp-operations from Gareth Smith confirmed that we cannot recover the data. |