RAL Tier1 Incident 20081102 RAID5 double disk failure

From GridPP Wiki
Revision as of 15:37, 7 November 2008 by James thorne (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Site: RAL-LCG2

Incident Date: 2008-11-02

Severity: Field not defined yet

Service: CASTOR

Impacted: ATLAS simStrip

Incident Summary: Double disk failure in RAID5 array on gdss156 during rebuild rendered array inoperable. First disk failed early on Saturday morning with the second early on Sunday morning. The second drive failure appears to have occurred before the rebuild finished and was a data disk, not a hot spare, hence the inoperable array.

Type of Impact: Data Loss

Incident duration:

Report date: 2008-11-07

Reported by: James Thorne, Tier1 Fabric Team

Related URLs: RAL Tier1 Incident 20081027, GGUS ticket 43111

Incident details:

Date Time Who/What Entry
2008-11-01 01:32:42 syslog Drive in port 3 fails early on a Saturday morning:

kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=3.
kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.

2008-11-01 01:43:57 Nagios Nagios issues first alarm:

SERVICE ALERT: gdss156;DMESG_ALL;CRITICAL;SOFT;1;Error - dmesg contains 3 lines: last 3 read: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=3.: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.:

2008-11-02 03:13:54 syslog Drive in port 13 fails early the following day and the array is no longer operable:

kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=13.
kernel: Aborting journal on device sdb1.
kernel: Aborting journal on device sdb2.
kernel: Buffer I/O error on device sdb1, logical block 116641
kernel: Device sdb not ready.
kernel: end_request: I/O error, dev sdb, sector 34

2008-11-02 04:02:57 Nagios Nagios issues an alarm for fsprobe as it cannot write to the filesystem -> second drive failure

SERVICE ALERT: gdss156;PROCS_EXIST_FSPROBE;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with command name fsprobe

2008-11-02 12:46:00 Alessandro Di Girolamo (ATLAS) Raised GGUS ticket 43111
2008-11-02 19:21:55 Catalin Condurache (on call) Created a Tier1 helpdesk ticket requesting that machine is taken out of CASTOR after seeing the GGUS ticket.
2008-11-02 20:45:14 Chris Kruk Removed gdss156 from production.
2008-11-03 14:18:00 James Adams Reported problem to Viglen, along with the output of Viglen's diagnostic tool. Waiting for Viglen's feedback.

Future mitigation:

For the measures taken regarding double disk failures, see the future mitigation section of RAL Tier1 Incident 20081027.

Related issues:

It was noted that there was an erroneous message in the logs in both recent double disk failures. In this incident, the messages file contained an obviously incorrect message after the failure of the last drive to fail, port 13:

Nov  2 03:13:54 gdss156 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=13.
Nov  2 03:13:54 gdss156 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=-2147483635.

In RAL Tier1 Incident 20081027, there is a similar message after the failure of port 5, again the last drive to fail:

Oct 27 04:37:56 gdss154 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=5.
Oct 27 04:37:56 gdss154 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=-2147483643.

This looks like an integer wraparound in the controller firmware as if we take the correct ports and the incorrect ports reported above:

         13 - -2147483635  =  2147483648
          5 - -2147483643  =  2147483648

This has been reported to Viglen and 3ware.

Timeline

Date Time Comment
Actually started 2008-11-01 01:32:42 First drive failed
Fault first detected 2008-11-02 04:02:57 Nagios
First Advisory Issued 2008-11-03 14:00 Gareth Smith reported the problem at the WLCG daily operations meeting.
First Intervention 2008-11-03 09:00:00 James Adams takes a look and confirms data is unrecoverable.
Fault Fixed When was the problem resolved
Announced as Fixed How, to who
Downtime(s) Logged in GOCDB n/a n/a none
Other Advisories Issued 2008-11-03 12:24 Gareth Smith emailed atlas-uk-comp-operations@cern.ch.
Other Advisories Issued 2008-11-03 n/a Brian Davies remained in contact with ATLAS.