RAL Tier1 Incident 20100630 Disk Server Data Loss CMS

From GridPP Wiki
Jump to: navigation, search

Data Loss following intervention on disk server.

Status: Closed

Site: RAL-LCG2

Incident Date: 2010-06-30

Severity: Severe

Service: CASTOR Disk

Impacted: CMS

Incident Summary: Following the failure of a disk server attempts to recover the files systems on the RAID array were unsuccessful and the contents of the disk server had been lost. The server is in a tape (D0T1) service class but there were files awaiting writing to tape at the time of the failure. 1083 CMS files were lost. Before a further assessment could be made of the importance of the data a mis-communication between staff meant that the disk array was rebuilt, removing any further chances of attempting data recovery even though it was unlikely this could succeed.

Type of Impact: Data Loss.

Incident duration: N/A

Report date: 2010-07-01

Reported by: Gareth Smith, Matthew Viljoen

Related URLs None

Incident details:

Disk server GDSS67 is part of the CMSFarmRead service class within Castor. This is a "D0T1" service class. I.e. any files put into the service class are migrated to tape. The rate at which files are migrated to tape depends on the rate at which data is added to the service class. For CMS there is a threshold quantity of data required to be present to trigger the tape migration, and how quickly this happens depends on the load on the tape systems. At any given time there are likely to be some files on disk awaiting migration to tape.

Disk server GDSS67 failed twice in quick succession. It is likely, despite testing, that the cause was the same in both cases and had not been fully resolved before the server was put back into production after the first failure. The first time GDSS67 failed there were no files on the server waiting to be written to tape. The file system was recreated (which would erase any data on the disk) before the server returned to production.

When the server failed a second time several days later there was data present on the disk waiting for migration to tape. A list of these files (a total of 1083) that became unavailable after the server crashed was supplied to CMS as per our standard procedures. A regular (daily) meeting takes place between representatives of the Fabric and Castor teams to track any outstanding disk server issues.

At such a meeting the Castor and Fabric team representatives discussed the situation on this server (GDSS67). Having failed so far to recover the filesystem on the array the Fabric team representative concluded any data had been lost and proposed recreating the raid array. Normal procedures require an assessment of the importance of the data before taking any action that would finally erase the files even if they are believed lost. However, a misunderstanding arose between those present as to the criticality of the data that was on the server, that it was probably already lost, and that the proposed action by the Fabric Team would definitively erase the data. Approval was given for the action to go ahead and this was recorded in the helpdesk ticket (#61336).

Future mitigation:

Issue Response
Mis-communication or mis-understanding led to an action on a disk server that erased data. Any proposed action as part of work on a disk server that will lead to a loss of data (or risk of loss of data) needs to be recorded by the Fabric team in the ticket with an explicit statement that files on disk will be lost or erased. This will then be signed off (by an update to the ticket) by the Castor Team before the action takes place. Such a decision will be made based on consultation about the importance of the data with the VO that owns the data. The Disk Server intervention procedure documentation has been updated, and the relevant staff informed, to incorporate this change in procedure whenever there is a risk of file deletion.
Delay in announcing data loss. The likelihood of data loss should be indicated by the Fabric Team to the Castor team as soon as it is realized that there is most likely lost data, rather than after exhaustive tests. The Disk Server intervention procedure documentation has been updated, and the relevant staff informed, to incorporate this change in procedure.
Migration policy allows large number of files to be vulnerable A review of the CMS migration policy has been undertaken. This concluded that it is not ideal but seems about right considering the constraints.

Related issues:

None.

Timeline

Date Time Who/What Entry
2010-05-20 09:11 Fabric Team FSPROBE errors and system crash on gdss67. Following which system taken out of production.
2010-06-01 Fabric Team Memory replaced. FSPROBE restarted to see if it errors again.
2010-06-17 15:00 Fabric Team Having confirmed with Castor Team they can recreate the file system, this is done.
2010-06-18 Fabric & Castor Teams Server returned to production.
2010-06-23 09:33 Fabric Team More FSPROBE errors, Request server taken out of production again.
2010-06-23 15:40 Castor Team Server taken out of production. Fabric team informed they can investigate.
2010-06-24 09:34 Castor Team Ticket reports that CMS have been given list of un-migrated files and reported back that there is nothing critical in the list and they can tolerate a longer downtime of the server (up to a week).
2010-06-25 12:01 Fabric Team Following daily discussion between Fabric & Castor teams, Fabric team confirms they will recreate array from scratch.
2010-06-25 13:10 Castor Team Operation (recreation of array) agreed to.
2010-06-30 16:00 Castor Team Realization of mis-understanding.
2010-06-30 17:00 Admin On Duty E-mail TO CMS informing them of data loss.
2010-07-15 09:00 Andrew Sansum, Matthew Viljoen and CMS contacts Completion of review of CMS migration policy.
2010-08-03 12:00 Gareth Smith, Matthew Viljoen and James Thorn Updates to procedures for disk server interventions completed and appropriate staff informed.