RAL Tier1 Incident 20090805 Data loss following multiple disk failures

From GridPP Wiki
Jump to: navigation, search

Site: Name of Site (eg RAL-LCG2)

Incident Date: 2009-08-05

Severity: Field not defined yet

Service: CASTOR

Impacted: ATLAS simStrip

Incident Summary: Disk server gdss169, configured as RAID5 plus hotspare, lost two drives (Port 9 & 15) on 5th August 2009. Managed to save the data with swift actions. But system lost another three drives (Port 9, 13 and 15) due to Air Con. failure in machine room (HPD Area). Drives in Port 9 and 15 failed on 12th August 2009 (within couple of minutes) and port 13 failed on 17th August 2009 after powering on disk servers.

Type of Impact: Data Loss

Incident duration:

Report date: 2009-08-18

Reported by: Kashif Hafeez, Tier1 Fabric Team

Related URLs:

Incident details:

Detailed timeline of events:

Date Time Who/What Entry
04/08/2009 17:28:20 Nagios Issued alarm:

 	Aug 04, 2009 05:28.20PM 	(0x04:0x0002): Degraded unit: unit=0, port=15
 	Aug 04, 2009 05:28.20PM 	(0x04:0x0009): Drive timeout detected: port=15:
05/08/2009 08:44:39 Kashif Hafeez Ticket created (RT # 48759) and Reported failed drive in port 15 to Viglen.(Wednesday 5th August 2009 at 08:35)
06/08/2009 09:56:46 Kashif Hafeez Noticed that system has have another faulty drive in port 9 (No log messages for drive 9)
06/08/2009 09:56:46 Shaun System had been taken out of production with coordination of Castor team.
06/08/2009 10:37:35 Kashif Hafeez Replaced drive in port 9 and rebuild started on port 9. (Borrowed from gdss87 Port 15)

 	Aug 06, 2009 10:05.56AM 	(0x04:0x001A): Drive inserted: port=9
 	Aug 06, 2009 10:05.37AM 	(0x04:0x0019): Drive removed: port=9
06/08/2009 10:39:08 Kashif Hafeez Replaced drive in port 15 and added as hotspare. (New drive received from viglen)

 	Aug 06, 2009 10:10.13AM 	(0x04:0x000B): Rebuild started: unit=0
 	Aug 06, 2009 10:09.57AM 	(0x04:0x001A): Drive inserted: port=15


07/08/2009 11:06:17 James Thorne Acknowledged Rebuild completed and informed Castor team to put system back into production.

 	Aug 07, 2009 12:05.37AM 	(0x04:0x0005): Rebuild completed: unit=0
07/08/2009 15:34:13 Chris Confirmed that system has been back into production
11/08/2009 23:44:38 Syslogs Issued soft alarm:
(0x04:0x004B):Battery temperature is high 
12/08/2009 12:15:23 syslogs Issued hard alarm:
(0x04:0x004D):Battery temperature is too high 
12/08/2009 12:20:47 Syslogs Two drives failed in port 15 and 9.

 	Aug 12, 2009 12:23.05AM 	(0x04:0x0042): Primary DCB read error occurred: port=9, error=0x204
 	Aug 12, 2009 12:22.35AM 	(0x04:0x005E): Cache synchronization completed: unit=0
 	Aug 12, 2009 12:22.35AM 	(0x04:0x0042): Primary DCB read error occurred: port=9, error=0x204
 	Aug 12, 2009 12:20.47AM 	(0x04:0x0009): Drive timeout detected: port=15

12/08/2009 13:00:16 Martin Bly/James Thorne Powered off Tier1 disk servers and batch systems due to Air Con. failure in Machine room. (HPD Area)
17/08/2009 10:30:53 James Thorne Turned ON Tier1 disk servers.
17/08/2009 10:38:53 Syslogs Another drive failure in port 13.

 	Aug 17, 2009 10:38.54AM 	(0x04:0x0009): Drive timeout detected: port=13
 	Aug 17, 2009 10:38.54AM 	(0x04:0x0009): Drive timeout detected: port=13
 	Aug 17, 2009 10:38.54AM 	(0x04:0x0009): Drive timeout detected: port=13
 	Aug 17, 2009 10:38.53AM 	(0x04:0x0009): Drive timeout detected: port=13
 	Aug 17, 2009 10:38.53AM 	(0x04:0x000A): Drive error detected: unit=0, 
 	Aug 17, 2009 10:38.53AM 	(0x04:0x0009): Drive timeout detected: port=13
17/08/2009 11:30:00 James Thorne Noticed that system has failed drives in port 9 and 13.
17/08/2009 14:01:00 John Kelly Created RT # 49105.
17/08/2009 15:00:00 Kashif Hafeez Informed Castor team to take system out of production and also asked for spare disk server for copying data. gdss273 pointed out by castor for copying data.
17/08/2009 15:25:00 James Thorne Tried to copy data from gdss169 to gdss273.
17/08/2009 15:30:54 James Thorne Failed to copy data. (Array was inoperable)
17/08/200 15:35:21 Kashif Hafeez Replaced drive in port 9 also powered off/on system but didn't work. (Borrowed from gdss87 Port 14)
17/08/2009 16:01:00 James Thorne/Kashif Hafeez Informed Castor and Production Team that the data is irrecoverable.


Future mitigation:

Free text description of how site plans to minimise future occurrences

Related issues:

Anything else relevant

Timeline

Date Time Comment
Actually Started 2009-08-12 12:20:47 Two drives failed (Port 15 and 9)
Fault first detected 12/08/2009 12:20:47 Syslogs/Admin/User
First Advisory Issued How/To who
First Intervention When you first tried to intervene
Fault Fixed When was the problem resolved
Announced as Fixed How, to who
Downtime(s) Logged in GOCDB at risk/unscheduled down (what components/VOs) repeat as necessary
Other Advisories Issued Where etc repeat as necessary