RAL Tier1 Incident 20100515 Disk Server Outage

From GridPP Wiki
Jump to: navigation, search

RAID 6 Disk Server (LHCb) unavailable for significant time after disk failure

Status: Closed

Site: RAL-LCG2

Incident Date: 2010-05-15

Severity: Moderate

Service: CASTOR Disk

Impacted: LHCb

Incident Summary: The failure of a disk drive caused the server to fail. This in turn led to the service class running out of space. Lengthy RAID rebuild time, linked with a second disk showing problems, led to extended unavailability of server.

Type of Impact: Lack of access to data. Lack of space to store data.

Incident duration: 4 days

Report date: 2010-05-19

Reported by: Gareth Smith, Tiju Idiculla, Shaun DeWitt

Related URLs GGUS Ticket 58253 at: https://gus.fzk.de/ws/ticket_info.php?ticket=58253

Incident details:

During the night 14/15 May disk server gdss380 crashed and was not reachable. The RAL Tier1 Primary on-call disabled the server from Castor as per documented procedures. (Procedures currently require a crashed disk server to be checked before re-enabling for service in order to protect data.) This server is part of a Castor D1T1 service class. The following day (Sunday 16th May) a GGUS Team ticket was received from LHCb indicating a problem on this service class which had become full. That day an additional server was deployed into the service class to resolve that problem.

During Monday LHCb were informed of the list of files rendered unavailable and an estimated time for the server to be returned to service. On Tuesday actions were taken to migrate the remaining files to tape and provide LHCb with access to the files while the server is not in production.

The disk server problems were triggered by a problem in a disk drive. However, as a RAID 6 system this should have carried on working transparently to Castor (and LHCb). Manual intervention on Monday enabled the server to reboot and the rebuild of the RAID array to start. However, a second disk drive was also reporting problems at this time. This combination (server failure, multiple disk problems) led to the decision to keep the server out of production while the rebuild takes place. However, rebuild times on these servers are very long (greater than two days).

As a replacement server had been provided, and access to all files that were on gdss380 restored, this incident was declared as closed on Wednedsay 19th May.

Future mitigation:

Issue Response
Extended outage when disk within RAID 6 system fails. The failure of the disk RAID 6 disk server when a disk within the array fails is not expected. This has been seen on three occasions within this particular batch of servers. It does not occur every time a disk fails. Action is ongoing to understand the causes of these server failures and address them with the vendor.
Notifications to LHCb. A review of the RAL Tier1 disk server intervention procedure was carried out on 19th May. This looked at issues of when servers can be restored to service and when VOs should be contacted.

Related issues:

About two weeks before this incident a similar failure on a similar server led to extended unavailability of a disk server for Atlas.

Timeline

Date Time Who/What Entry
2010-05-15 00:42:11 Tiju Idiculla (Primary Oncall) Nagios callout-> Diskserver not reachable. Disabled machine in Castor.(RT#60140)
2010-05-15 11:00 Tiju Idiculla (Primary Oncall) Informed Castor Oncall and Fabric Oncall about the diskserver. (RT#60140)
2010-05-15 13:04 Kash (Hardware Technician) Commenced remote investigation. (RT#60142)


2010-05-16 10:21:32 Tiju Idiculla (Primary Oncall) Call from John about GGUS ticket(58253). Informed Castor Oncall. (RT#60146)
2010-05-16 12:15 Matthew Viljoen (Castor Oncall) Deployed gdss458 into lhcbmdst (RT#60146)
2010-05-17 Lunchtime Shaun DeWitt Passed LHCb the list of files on the disk server
2010-05-17 afternoon Shaun DeWitt Informed LHCb of likely time the disk server would be unavailable
2010-05-18 09:30 Shaun DeWitt Moved the files by setting the disk server into DRAINING state and did a stager_get into correct service class
2010-05-18 10:00 Shaun DeWitt All files migrated to tape
2010-05-18 11:00 Shaun DeWitt Received request to restage all files on disabled disk server
2010-05-18 11:30 Shaun DeWitt Added two more disk servers since space was marginal for the bulk recall
2010-05-18 12:00 Shaun DeWitt Issued a stager_get for all files online on gdss380 to force tape recalls
2010-05-18 16:30 Shaun DeWitt about 2000/11000 recalled from tape
2010-05-19 09:00 Shaun DeWitt All files staged
2010-05-19 15:30 Gareth Smith Internal (RAL Tier1) review of disk intervention procedures.