RAL Tier1 Incident 20081027 Data loss after multiple disk failure

Site: RAL-LCG2

Incident Date: 2008-10-27

Severity: Field not defined yet

Service: CASTOR

Impacted: ATLAS simStrip

Incident Summary: Disk server gdss154, configured as RAID5 plus hotspare, lost two drives. Replacement drives were ordered but then no-one noticed that they had not arrived and been installed. A week later, a third drive failed and rendered the array inoperable.

Type of Impact: Data Loss

Incident duration: 11 Days

Report date: 2008-11-07

Reported by: James Thorne, Tier1 Fabric Team

Incident details:

Detailed timeline of events:


Date	Time	Who/What	Entry
17/10/2008	22:16:16	syslog	Drive 14 failed. `kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=14. kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=14.`
17/10/2008	22:28:53	Nagios	Issued soft alarm: `SERVICE ALERT: gdss154;DMESG_ALL;CRITICAL;SOFT;1;Error - dmesg contains 2 lines: last 2 read: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=14.: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=14.:`
17/10/2008	22:35:26	Nagios	Issued hard alarm: `SERVICE ALERT: gdss154;DMESG_ALL;CRITICAL;HARD;2;Error - dmesg contains 2 lines: last 2 read: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=14.: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=14.:`
20/10/2008		James Thorne/James Adams	Nagios ndo2db daemon restarted twice (some alarms may have been lost)
21/10/2008	15:50:00	James Adams	Noticed that both drives 13 and 14 had failed. No log messages for drive 13.
21/10/2008	16:22:05	syslog, James Adams	Drives removed from ports 13 1nd 14: `kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0019): Drive removed:port=14.`
21/10/2008	17:00:00	James Adams	Reported drives 13 and 14 had failed to Viglen and requested replacements. This was in a mail with other reported faulty drives.
23/10/2008	08:36:00	Jonathan Wheeler	Acknowledged Nagios alarm
27/10/2008	04:37:56	syslog	Drive in port 5 failed, first device errors in syslog: kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=5. kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=-2147483643. kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0. kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x004F): Cache synchronization skipped:unit=1. kernel: Device sdb not ready. kernel: end_request: I/O error, dev sdb, sector 3917322240 kernel: Buffer I/O error on device sdb2, logical block 1394432 kernel: lost page write due to I/O error on sdb2 These errors continued until the machine was taken out of service.
27/10/2008	05:04:48	Nagios	Issued alarm for `fsprobe` errors: `SERVICE ALERT: gdss154;PROCS_EXIST_FSPROBE;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with command name 'fsprobe'`
28/10/2008	09:59:56	Brian Davies	Creates helpdesk ticket 36373
28/10/2008	10:01:00	Brian Davies	Report of ATLAS problems on gdss154 reaches CASTOR-OP mailing list
28/10/2008	13:41:16	Tim Folkes	Merged ticket 36379 into 36373
28/10/2008	13:54:39	James Thorne	Merged tickets 36374 abd 36375 into 36373
28/10/2008	14:00:00	James Thorne	Intervention started
28/10/2008	14:00:00	Gareth Smith	Initial report to WLCG daily operations meeting.
28/10/2008	14:11:00	Martin Bly	Contact Viglen to chase up drives
28/10/2008	15:54:54	James Thorne	Machine put into Nagios downtime
28/10/2008	16:13:21	Martin Bly	Tags machine as faulty in the Tier1 Mimic
29/10/2008	09:45:00	James Adams, James Thorne	Collect new drives from Stores
29/10/2008	13:01:00	Gareth Smith	Sent email to atlas-uk-comp-operations@cern.ch to highlight problem
29/10/2008	14:50:00	James Adams	Attempted to recover array but disk from port 5 just made a clunking noise when powered up.
29/10/2008	16:30:00	James Adams	Reports that there is no chance of recovering data
29/10/2008	16:45:00	James Adams, Kashif Hafeez	Failed drives replaced
29/10/2008	16:51:21	James Adams, Kashif Hafeez	Array initialize started
29/10/2008	17:01:00	Gareth Smith	Sent email to atlas-uk-comp-operations@cern.ch confirming that there is no change of recovering the data.
30/10/2008	00:12:38	3ware log	Array initialize finished
30/10/2008	14:50:00	James Thorne	Repartitioned array and started acceptance testing to run for 7 Days
30/10/2008	15:45:08	James Thorne	Nagios downtime extended to cover 7-day testing period
04/11/2008	17:56:27	3ware log	Drive in port 12 thrown out: `ERROR Drive timeout detected: port=12 Degraded unit: unit=1, port=12`
05/11/2008	11:00:00	James Thorne	New drive ordered from Viglen
06/11/2008	14:45:00	James Thorne	Stopped testing
07/11/2008	14:12:07	Kashif Hafeez	Replacement drive inserted in port 12.

The ganglia plots show that the host was still working after the first two disk failures (end of week 42). The third disk failed at the start of week 44:

File:Gdss154-load-graph.png File:Gdss154-net-graph.png File:Gdss154-cpu-graph.png File:Gdss154-mem-graph.png

Future mitigation:

The fabric team have implemented several changes as a result of this incident:

A new Nagios test has been deployed which has easier to interpret output, making double drive failures easier to spot. The two double disk failure events recently were hard to spot as there was not a new alarm for the server for the second disk as the previous alarm was still CRITICAL.
Similar Nagios tests are being developed for Areca and other non-3ware systems.
There is now a "Fabric Dashboard" on the Tier1 helpdesk showing hardware tickets that haven't been updated for 2 days, making tickets that have stalled due to components (e.g. disks) not showing up to be spotted easily and chased up. this is likely to be an interim solution as there is some discussion of making the helpdesk actively escalate hardware problems.
There have been several changes to the Fabric Team's written incident response procedure:
- If replacement disks have not been received after 2 working days then the issue is escalated to urgent with the appropriate vendor.
- If there is a second disk failure during a rebuild, a disk from a "spare" system should be inserted as a hotspare and the vendor notified of the serial number for their records.
- Team members will not acknowledge a disk failure alarm until it has been fixed.
The fabric team will ask vendors if we can have some spare disks to hold on site rather than using drives from spare systems in an emergency.
Double disk failures will be added to the alarms for call out.

Related issues:

There was a second double disk failure in a machine of the same generation (RAL Tier1 Incident 20081102). It is not thought to be linked to this failure but see the notes under related issues in the incident report.

Timeline


	Date	Time	Comment
Actually Started	2008-10-17	22:16	Disk in port 14 failed. Nagios alarm for port 14. Systems administrator noticed that there was a failed drive in port 13 but no logs or alarm. Replacement drives requested on 2008-10-21.
Fault first detected	2008-10-28	10:01	Brian Davies noticed ATLAS transfers to gdss154 were failing and alerted CASTOR team.
First Advisory Issued	2008-10-08	14:00	Gareth Smith initally announced the problem at the WLCG daily operations meeting.
First Intervention	2008-10-28	11:00	Host removed from CASTOR service.
Fault Fixed	2008-11-07	14:12	Array rebuilt, tested and ready for production. Machine will be redeployed to a different service class.
Announced as Fixed			How, to who
Downtime(s) Logged in GOCDB	n/a	n/a	None
Other Advisories Issued	2008-10-29	13:01	Email to atlas-uk-comp-operations from Gareth Smith announced the problem.
Other Advisories Issued	2008-10-29	17:01	Email to atlas-uk-comp-operations from Gareth Smith confirmed that we cannot recover the data.

RAL Tier1 Incident 20081027 Data loss after multiple disk failure

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools