RAL Tier1 Incident 20100916 Second Failure of Disk Server-CMS data loss

From GridPP Wiki
Revision as of 11:56, 4 July 2011 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Incident 20100916 Second Failure of Disk Server - CMS data loss.

Report in Progress

Description:

Disk server GDSS280, part of the CMSFarmRead Castor Service Class reported FSPROBE errors on 19th August 2010. It was returned to production on 14th September but very soon had more FSPROBE errors and was removed from service the next day. This time the file system was corrupted and 30 files that had not been migrated to tape were lost.

Impact

CMSFarmRead is a D0T1 service class. The first failure of the disk server did not lead to the loss of any data and apart from reduction in the capacity (and bandwidth) on that service class for an extended period (19th August to 14th September) did not cause CMS significant problems. The second failure resulted in the loss of thirty CMS files. There is further period of server unavailability during the re-testing to find the underlying hardware fault.

Timeline of the Incident

When What
Thu Aug 19 05:27 FSProbe reports problem on disk server GDSS280.
Thu Aug 19 08:06 Found lots of 'EXT3-fs error (device sda1): ext3_free_branches: Read failure' errors on screen. (Kashif)
Thu Aug 19 10:32 No error found in DMI and IPMI logs. Started memory test. (Kashif)
Mon Aug 23 11:37 Passed memory test. (4 days) System didn't come up after reboot. Got stuck during file system check. (Kashif)
Tue Aug 24 11:51 Updated ticket with fault image. (Kashif)
Wed Sep 08 10:43 Found errors on Castor1 partition (/dev/sda1). Started fsck on Castor1. File system came up in read only mode after finishing fsck. (James T)
Mon Sep 13 15:32 Informed Castor team that we need to re-create file system. Matt V confirmed that we can re-create file system. (James T)
Mon Sep 13 15:44 Re-created file system. (James T)
Wed Sep 15 08:59 Checked all logs and found no hardware fault. Request Castor team to put system back into production. (Kashif)
Wed Sep 15 10:20 Castor team return server to production.
Wed Sep 15 22:58 Fsprobe errors again. Requested and confirmed with Production/Castor team to start acceptance test on gdss280.
Thu Sep 16 08:45 Server removed from production again.
Thu Sep 16 15:06 Requested Castor team for confirmation that there is no data on disk server. As Fabric team wants to run acceptance test on gdss280 (Kashif)
Mon Sep 20 16:58: Matt V confirms that they don't need the data on this disk server and it may be deleted. Fabric team can start intervention.
Wed Sep 22 09:42 Deleted array and created new array. Started initializing drives/array. (Kashif)
Tue Sep 28 11:06 Disk server got stuck while installing 'sl4-disk-viglen-2007-pretest-intel' for acceptance test. Informed James T. (Kashif)
Tue Sep 28 14:51 Started acceptance test with the help of James A. (Kashif)
Fri Oct 01 09:00 Server crashed during acceptance test. Found 'lots of Buffer I/O errors' and controller reset messages.
Fri Oct 01 10:47 Raised vendor call with error logs. Suspected faulty raid card. (Kashif)
Mon Oct 11 10:43 Received wrong type of raid card. (4 ports instead of 16 ports) Reported again on same day. (Kashif)
Fri Oct 15 12:21 Received wrong type of raid card again. (24 ports instead of 16 ports) Reported again on same day. (Kashif)
Tue Oct 19 17:06 Vendor do not have card in stock and are trying to get one for us. (Kashif)

Incident details

On 19th August disk server GDSS280 failed with FSPROBE errors. Problems were found with the filesystem holding user (Castor) data. The disk server is part of a Castor Disk0Tape1 service class and files are migrated from the disk server to tape. At the time of the failure on 19th August all files on GDSS280 had been copied to tape. It was therefore possible to re-create the filesystem without data loss and this was done. Testing was then carried out on the server although over a protracted length of time. As no further errors were found the servers was returned to production on 15th September. However, half a day later FSPROBE errors were again reported. On investigating the filesystem was found to be corrupted. This time there were 30 files on the server that had not been migrated to tape and were declared to the VO as lost.

Subsequently the server has been put through the aggressive acceptance tests we use for new disk servers and these have uncovered a hardware fault.

Analysis

FSProbe provides a very useful indication of a problem on a disk server. It regularly writes data to disk, reads it back and confirms it has not changed. It indicates a problem on a disk server, but does not identify that problem. Tests were made between the time of the first failure on GDSS280 (19th August) and its first return to service (14th September), but no hardware fault was found. The outage was prolonged and there were periods when the server problems were not being followed up actively. Individual subsystems were tested but these tests failed to identify that there was still a problem with the disk server. The system was returned to service based on these tests and failed again shortly afterwards. Since the second failure more aggressive testing of the server (using the acceptance tests used for new server purchases) have been carried out and these have identified a problem on the server.

Follow Up

Issue Response Done
Disk Server failed shortly after being returned to production. Where practical, such as in this case where the VO did not require the server back urgently, more extensive testing, or re-certification, should be carried out before the server is returned to production. Procedures should be modified for this to be included in standard operations. Yes
Requirement for fast test regime for servers containing data required urgently. In cases where files on the server are required urgently but there is no useful indication of the fault from logfiles etc. a rapid test procedure needs to be defined. Whilst we do have a current set of tests there is some optimisation that can be made. Note added following review on 28/06/11. The additional capabilities now available within Castor (see next item) have made this action much less important. N/A
Put service into a read-only mode and drain files while to enable tests to be run. At present (October 2010) Castor is being upgraded to version 2.1.9 which supports native disk draining. When 'draining' the server is effectively in a read-only mode. This could be used as a step when bring a disk server back into production, before allowing full access. However, at present there is an issue whereby FTS views files on Castor disk servers in draining mode as unavailable. (This is expected to be fixed in Castor version 2.1.10, but the roll-out of that is not expected for some months.) During November 2010 an alternative method of putting a disk server into a 'read only' mode that gets around the problems with the FTS was used. A review of the use of this read-only state, draining and replacing disk servers, along with the appropriate set of tests to be used to quickly troubleshoot a disk server should be made. Note added following review on 28/06/11. Procedures are now in place making use of these additional capabilities now available within Castor (i.e. draining and read-only servers). Yes
The disk server was unavailable for an extended time Production disk servers that do not contain critical data (ie. not in D1T0 service classes) do not need to be returned to service so urgently. Nevertheless, extended outages of these servers should be avoided where possible. Procedures should be reviewed to ensure such servers do not drop too far down the priority list. (Done by including a review of outstanding issues in the Castor-Fabric Teams weekly meetings.) Yes

Related issues

The following Post Mortem also considers change to Disk Intervention Procedures:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100801_Disk_Server_Data_Loss_Atlas

Reported by: Gareth Smith 24th October 2010

Summary Table

Start Date 19 August 2010
Impact >80%
Duration of Outage 60 days
Status Closed
Root Cause Hardware
Data Loss Yes