RAL Tier1 Incident 20081027 Data loss after multiple disk failure

From GridPP Wiki
Jump to: navigation, search

Site: RAL-LCG2

Incident Date: 2008-10-27

Severity: Field not defined yet

Service: CASTOR

Impacted: ATLAS simStrip

Incident Summary: Disk server gdss154, configured as RAID5 plus hotspare, lost two drives. Replacement drives were ordered but then no-one noticed that they had not arrived and been installed. A week later, a third drive failed and rendered the array inoperable.

Type of Impact: Data Loss

Incident duration: 11 Days

Report date: 2008-11-07

Reported by: James Thorne, Tier1 Fabric Team

Related URLs: Ganglia plots for gdss154, RAL Tier1 Incident 20081102

Incident details:

Detailed timeline of events:

Date Time Who/What Entry
17/10/2008 22:16:16 syslog Drive 14 failed.

kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=14.
kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=14.

17/10/2008 22:28:53 Nagios Issued soft alarm:

SERVICE ALERT: gdss154;DMESG_ALL;CRITICAL;SOFT;1;Error - dmesg contains 2 lines: last 2 read: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=14.: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=14.:

17/10/2008 22:35:26 Nagios Issued hard alarm:

SERVICE ALERT: gdss154;DMESG_ALL;CRITICAL;HARD;2;Error - dmesg contains 2 lines: last 2 read: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=14.: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=14.:

20/10/2008 James Thorne/James Adams Nagios ndo2db daemon restarted twice (some alarms may have been lost)
21/10/2008 15:50:00 James Adams Noticed that both drives 13 and 14 had failed. No log messages for drive 13.
21/10/2008 16:22:05 syslog, James Adams Drives removed from ports 13 1nd 14:

kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0019): Drive removed:port=14.

21/10/2008 17:00:00 James Adams Reported drives 13 and 14 had failed to Viglen and requested replacements. This was in a mail with other reported faulty drives.
23/10/2008 08:36:00 Jonathan Wheeler Acknowledged Nagios alarm
27/10/2008 04:37:56 syslog Drive in port 5 failed, first device errors in syslog:

kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=5.
kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=-2147483643.
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.
kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x004F): Cache synchronization skipped:unit=1.
kernel: Device sdb not ready.
kernel: end_request: I/O error, dev sdb, sector 3917322240
kernel: Buffer I/O error on device sdb2, logical block 1394432
kernel: lost page write due to I/O error on sdb2

These errors continued until the machine was taken out of service.

27/10/2008 05:04:48 Nagios Issued alarm for fsprobe errors:

SERVICE ALERT: gdss154;PROCS_EXIST_FSPROBE;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with command name 'fsprobe'

28/10/2008 09:59:56 Brian Davies Creates helpdesk ticket 36373
28/10/2008 10:01:00 Brian Davies Report of ATLAS problems on gdss154 reaches CASTOR-OP mailing list
28/10/2008 13:41:16 Tim Folkes Merged ticket 36379 into 36373
28/10/2008 13:54:39 James Thorne Merged tickets 36374 abd 36375 into 36373
28/10/2008 14:00:00 James Thorne Intervention started
28/10/2008 14:00:00 Gareth Smith Initial report to WLCG daily operations meeting.
28/10/2008 14:11:00 Martin Bly Contact Viglen to chase up drives
28/10/2008 15:54:54 James Thorne Machine put into Nagios downtime
28/10/2008 16:13:21 Martin Bly Tags machine as faulty in the Tier1 Mimic
29/10/2008 09:45:00 James Adams, James Thorne Collect new drives from Stores
29/10/2008 13:01:00 Gareth Smith Sent email to atlas-uk-comp-operations@cern.ch to highlight problem
29/10/2008 14:50:00 James Adams Attempted to recover array but disk from port 5 just made a clunking noise when powered up.
29/10/2008 16:30:00 James Adams Reports that there is no chance of recovering data
29/10/2008 16:45:00 James Adams, Kashif Hafeez Failed drives replaced
29/10/2008 16:51:21 James Adams, Kashif Hafeez Array initialize started
29/10/2008 17:01:00 Gareth Smith Sent email to atlas-uk-comp-operations@cern.ch confirming that there is no change of recovering the data.
30/10/2008 00:12:38 3ware log Array initialize finished
30/10/2008 14:50:00 James Thorne Repartitioned array and started acceptance testing to run for 7 Days
30/10/2008 15:45:08 James Thorne Nagios downtime extended to cover 7-day testing period
04/11/2008 17:56:27 3ware log Drive in port 12 thrown out:

ERROR Drive timeout detected: port=12
Degraded unit: unit=1, port=12

05/11/2008 11:00:00 James Thorne New drive ordered from Viglen
06/11/2008 14:45:00 James Thorne Stopped testing
07/11/2008 14:12:07 Kashif Hafeez Replacement drive inserted in port 12.

The ganglia plots show that the host was still working after the first two disk failures (end of week 42). The third disk failed at the start of week 44:

File:Gdss154-load-graph.png File:Gdss154-net-graph.png File:Gdss154-cpu-graph.png File:Gdss154-mem-graph.png

Future mitigation:

The fabric team have implemented several changes as a result of this incident:

  • A new Nagios test has been deployed which has easier to interpret output, making double drive failures easier to spot. The two double disk failure events recently were hard to spot as there was not a new alarm for the server for the second disk as the previous alarm was still CRITICAL.
  • Similar Nagios tests are being developed for Areca and other non-3ware systems.
  • There is now a "Fabric Dashboard" on the Tier1 helpdesk showing hardware tickets that haven't been updated for 2 days, making tickets that have stalled due to components (e.g. disks) not showing up to be spotted easily and chased up. this is likely to be an interim solution as there is some discussion of making the helpdesk actively escalate hardware problems.
  • There have been several changes to the Fabric Team's written incident response procedure:
    • If replacement disks have not been received after 2 working days then the issue is escalated to urgent with the appropriate vendor.
    • If there is a second disk failure during a rebuild, a disk from a "spare" system should be inserted as a hotspare and the vendor notified of the serial number for their records.
    • Team members will not acknowledge a disk failure alarm until it has been fixed.
  • The fabric team will ask vendors if we can have some spare disks to hold on site rather than using drives from spare systems in an emergency.
  • Double disk failures will be added to the alarms for call out.

Related issues:

There was a second double disk failure in a machine of the same generation (RAL Tier1 Incident 20081102). It is not thought to be linked to this failure but see the notes under related issues in the incident report.

Timeline

Date Time Comment
Actually Started 2008-10-17 22:16 Disk in port 14 failed. Nagios alarm for port 14. Systems administrator noticed that there was a failed drive in port 13 but no logs or alarm. Replacement drives requested on 2008-10-21.
Fault first detected 2008-10-28 10:01 Brian Davies noticed ATLAS transfers to gdss154 were failing and alerted CASTOR team.
First Advisory Issued 2008-10-08 14:00 Gareth Smith initally announced the problem at the WLCG daily operations meeting.
First Intervention 2008-10-28 11:00 Host removed from CASTOR service.
Fault Fixed 2008-11-07 14:12 Array rebuilt, tested and ready for production. Machine will be redeployed to a different service class.
Announced as Fixed How, to who
Downtime(s) Logged in GOCDB n/a n/a None
Other Advisories Issued 2008-10-29 13:01 Email to atlas-uk-comp-operations from Gareth Smith announced the problem.
Other Advisories Issued 2008-10-29 17:01 Email to atlas-uk-comp-operations from Gareth Smith confirmed that we cannot recover the data.