RAL Tier1 Incident 20090805 Data loss following multiple disk failures

Site: Name of Site (eg RAL-LCG2)

Incident Date: 2009-08-05

Severity: Field not defined yet

Service: CASTOR

Impacted: ATLAS simStrip

Incident Summary: Disk server gdss169, configured as RAID5 plus hotspare, lost two drives (Port 9 & 15) on 5th August 2009. Managed to save the data with swift actions. But system lost another three drives (Port 9, 13 and 15) due to Air Con. failure in machine room (HPD Area). Drives in Port 9 and 15 failed on 12th August 2009 (within couple of minutes) and port 13 failed on 17th August 2009 after powering on disk servers.

Type of Impact: Data Loss

Incident duration:

Report date: 2009-08-18

Reported by: Kashif Hafeez, Tier1 Fabric Team

Related URLs:

Incident details:

Detailed timeline of events:


Date	Time	Who/What	Entry
04/08/2009	17:28:20	Nagios	Issued alarm: `Aug 04, 2009 05:28.20PM (0x04:0x0002): Degraded unit: unit=0, port=15 Aug 04, 2009 05:28.20PM (0x04:0x0009): Drive timeout detected: port=15:`
05/08/2009	08:44:39	Kashif Hafeez	Ticket created (RT # 48759) and Reported failed drive in port 15 to Viglen.(Wednesday 5th August 2009 at 08:35)
06/08/2009	09:56:46	Kashif Hafeez	Noticed that system has have another faulty drive in port 9 (No log messages for drive 9)
06/08/2009	09:56:46	Shaun	System had been taken out of production with coordination of Castor team.
06/08/2009	10:37:35	Kashif Hafeez	Replaced drive in port 9 and rebuild started on port 9. (Borrowed from gdss87 Port 15) `Aug 06, 2009 10:05.56AM (0x04:0x001A): Drive inserted: port=9 Aug 06, 2009 10:05.37AM (0x04:0x0019): Drive removed: port=9`
06/08/2009	10:39:08	Kashif Hafeez	Replaced drive in port 15 and added as hotspare. (New drive received from viglen) `Aug 06, 2009 10:10.13AM (0x04:0x000B): Rebuild started: unit=0 Aug 06, 2009 10:09.57AM (0x04:0x001A): Drive inserted: port=15`
07/08/2009	11:06:17	James Thorne	Acknowledged Rebuild completed and informed Castor team to put system back into production. `Aug 07, 2009 12:05.37AM (0x04:0x0005): Rebuild completed: unit=0`
07/08/2009	15:34:13	Chris	Confirmed that system has been back into production
11/08/2009	23:44:38	Syslogs	Issued soft alarm: `(0x04:0x004B):Battery temperature is high`
12/08/2009	12:15:23	syslogs	Issued hard alarm: `(0x04:0x004D):Battery temperature is too high`
12/08/2009	12:20:47	Syslogs	Two drives failed in port 15 and 9. `Aug 12, 2009 12:23.05AM (0x04:0x0042): Primary DCB read error occurred: port=9, error=0x204 Aug 12, 2009 12:22.35AM (0x04:0x005E): Cache synchronization completed: unit=0 Aug 12, 2009 12:22.35AM (0x04:0x0042): Primary DCB read error occurred: port=9, error=0x204 Aug 12, 2009 12:20.47AM (0x04:0x0009): Drive timeout detected: port=15`
12/08/2009	13:00:16	Martin Bly/James Thorne	Powered off Tier1 disk servers and batch systems due to Air Con. failure in Machine room. (HPD Area)
17/08/2009	10:30:53	James Thorne	Turned ON Tier1 disk servers.
17/08/2009	10:38:53	Syslogs	Another drive failure in port 13. `Aug 17, 2009 10:38.54AM (0x04:0x0009): Drive timeout detected: port=13 Aug 17, 2009 10:38.54AM (0x04:0x0009): Drive timeout detected: port=13 Aug 17, 2009 10:38.54AM (0x04:0x0009): Drive timeout detected: port=13 Aug 17, 2009 10:38.53AM (0x04:0x0009): Drive timeout detected: port=13 Aug 17, 2009 10:38.53AM (0x04:0x000A): Drive error detected: unit=0, Aug 17, 2009 10:38.53AM (0x04:0x0009): Drive timeout detected: port=13`
17/08/2009	11:30:00	James Thorne	Noticed that system has failed drives in port 9 and 13.
17/08/2009	14:01:00	John Kelly	Created RT # 49105.
17/08/2009	15:00:00	Kashif Hafeez	Informed Castor team to take system out of production and also asked for spare disk server for copying data. gdss273 pointed out by castor for copying data.
17/08/2009	15:25:00	James Thorne	Tried to copy data from gdss169 to gdss273.
17/08/2009	15:30:54	James Thorne	Failed to copy data. (Array was inoperable)
17/08/200	15:35:21	Kashif Hafeez	Replaced drive in port 9 also powered off/on system but didn't work. (Borrowed from gdss87 Port 14)
17/08/2009	16:01:00	James Thorne/Kashif Hafeez	Informed Castor and Production Team that the data is irrecoverable.

Future mitigation:

Free text description of how site plans to minimise future occurrences

Related issues:

Anything else relevant

Timeline


	Date	Time	Comment
Actually Started	2009-08-12	12:20:47	Two drives failed (Port 15 and 9)
Fault first detected	12/08/2009	12:20:47	Syslogs/Admin/User
First Advisory Issued			How/To who
First Intervention			When you first tried to intervene
Fault Fixed			When was the problem resolved
Announced as Fixed			How, to who
Downtime(s) Logged in GOCDB			at risk/unscheduled down (what components/VOs) repeat as necessary
Other Advisories Issued			Where etc repeat as necessary

RAL Tier1 Incident 20090805 Data loss following multiple disk failures

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools