Difference between revisions of "RAL Tier1 Incident 20081102 RAID5 double disk failure"

Latest revision as of 15:37, 7 November 2008

Site: RAL-LCG2

Incident Date: 2008-11-02

Severity: Field not defined yet

Service: CASTOR

Impacted: ATLAS simStrip

Incident Summary: Double disk failure in RAID5 array on gdss156 during rebuild rendered array inoperable. First disk failed early on Saturday morning with the second early on Sunday morning. The second drive failure appears to have occurred before the rebuild finished and was a data disk, not a hot spare, hence the inoperable array.

Type of Impact: Data Loss

Incident duration:

Report date: 2008-11-07

Reported by: James Thorne, Tier1 Fabric Team

Related URLs: RAL Tier1 Incident 20081027, GGUS ticket 43111

Incident details:


Date	Time	Who/What	Entry
2008-11-01	01:32:42	syslog	Drive in port 3 fails early on a Saturday morning: `kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=3. kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.`
2008-11-01	01:43:57	Nagios	Nagios issues first alarm: `SERVICE ALERT: gdss156;DMESG_ALL;CRITICAL;SOFT;1;Error - dmesg contains 3 lines: last 3 read: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, port=3.: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=3.:`
2008-11-02	03:13:54	syslog	Drive in port 13 fails early the following day and the array is no longer operable: `kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=13. kernel: Aborting journal on device sdb1. kernel: Aborting journal on device sdb2. kernel: Buffer I/O error on device sdb1, logical block 116641 kernel: Device sdb not ready. kernel: end_request: I/O error, dev sdb, sector 34`
2008-11-02	04:02:57	Nagios	Nagios issues an alarm for fsprobe as it cannot write to the filesystem -> second drive failure `SERVICE ALERT: gdss156;PROCS_EXIST_FSPROBE;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with command name fsprobe`
2008-11-02	12:46:00	Alessandro Di Girolamo (ATLAS)	Raised GGUS ticket 43111
2008-11-02	19:21:55	Catalin Condurache (on call)	Created a Tier1 helpdesk ticket requesting that machine is taken out of CASTOR after seeing the GGUS ticket.
2008-11-02	20:45:14	Chris Kruk	Removed gdss156 from production.
2008-11-03	14:18:00	James Adams	Reported problem to Viglen, along with the output of Viglen's diagnostic tool. Waiting for Viglen's feedback.

Future mitigation:

For the measures taken regarding double disk failures, see the future mitigation section of RAL Tier1 Incident 20081027.

Related issues:

It was noted that there was an erroneous message in the logs in both recent double disk failures. In this incident, the messages file contained an obviously incorrect message after the failure of the last drive to fail, port 13:

Nov  2 03:13:54 gdss156 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=13.
Nov  2 03:13:54 gdss156 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=-2147483635.

In RAL Tier1 Incident 20081027, there is a similar message after the failure of port 5, again the last drive to fail:

Oct 27 04:37:56 gdss154 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=5.
Oct 27 04:37:56 gdss154 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x000A): Drive error detected:unit=1, port=-2147483643.

This looks like an integer wraparound in the controller firmware as if we take the correct ports and the incorrect ports reported above:

         13 - -2147483635  =  2147483648
          5 - -2147483643  =  2147483648

This has been reported to Viglen and 3ware.

Timeline


	Date	Time	Comment
Actually started	2008-11-01	01:32:42	First drive failed
Fault first detected	2008-11-02	04:02:57	Nagios
First Advisory Issued	2008-11-03	14:00	Gareth Smith reported the problem at the WLCG daily operations meeting.
First Intervention	2008-11-03	09:00:00	James Adams takes a look and confirms data is unrecoverable.
Fault Fixed			When was the problem resolved
Announced as Fixed			How, to who
Downtime(s) Logged in GOCDB	n/a	n/a	none
Other Advisories Issued	2008-11-03	12:24	Gareth Smith emailed atlas-uk-comp-operations@cern.ch.
Other Advisories Issued	2008-11-03	n/a	Brian Davies remained in contact with ATLAS.

Difference between revisions of "RAL Tier1 Incident 20081102 RAID5 double disk failure"

Latest revision as of 15:37, 7 November 2008

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools