Difference between revisions of "RAL Tier1 Incident 20180103 Atlas Data Server Failure"

From GridPP Wiki
Jump to: navigation, search
(Follow Up)
 
Line 106: Line 106:
 
| ''No''
 
| ''No''
 
|}
 
|}
 
===Related issues===
 
 
''List any related issue and provide links if possible. If there are none then remove this section.''
 
  
  

Latest revision as of 11:38, 26 April 2018

TITLE - include RAL-LCG2 in title

Description:

Wed 3rd Jan gdss745 (atlasStripInput d1t0) started reporting a large number of errors (3 drives errors and 1 virtual drive off-line). Sever was taken out of production same day. Work on the server was unsuccessfully attempted for the next 5 days. on finding the CASTOR partitioning corrupt, total data loss declared on the 8th.

Impact

The complete loss of server resulted in the loss of ~4.4 million files for ATLAS.

Timeline of the Incident

When What
3rd Jan 2018 - 09:34 gdss745 reporting high media error rate
3rd Jan 2018 - 14:36 Missing drive replaced but it's not rebuilding because there are high number of media errors on other drives. gdss745 put to RO in an attempt to reduce loading.
3rd Jan 2018 - 14:48 CASTOR set to RO by JPK
3rd Jan 2018 - 15:41 gdss745 taken out of production as per CW request. ATLAS informed of server off-line - JPK
4th Jan 2018 - 10:33 CW analysis 3 physical drives down and a 1 virtual drive off-line, "Drive is a complete mess!"
4th Jan 2018 - 13:29 Filelist of temporarily unavailable files sent to Atlas - JPK
5th Jan 2018 - 08:55 Another drive replaced and rebuild started
8th Jan 2018 - 10:27 CASTOR partitioning corrupted, complete data loss confirmed. ATLAS informed of dataloss

Incident details

Put a reasonably detailed description of the incident here.

A few notes from John Kelly The main RT ticket for this is RT#200396. I note that the data loss appears to have been dealt with in RT#200429 I also note that there doesn't appear to be an elog for this.

Analysis

RA, has questioned the total number of files lost and believes the real lost file count to be inaccurate. This is based on the fact that a typically full drive should only contain ~600k +/- 40%

There appears to have been a lack/failing of the monitoring of failing drives. This was made apparent when CW reported seeing drive failures from lights on the drives themselves, rather than seeing them being reported by Nagios. One can then speculate that drives were possibly failing over the holiday period and we/Nagios was not picking this up.

Also suggested that a Nagios test is created to check for an alert if drive errors over a given limit i.e. >1000 errors per drive, call-out.

In essence, discovering of the failed drives by the first physical check of the year and not by monitoring is not an ideal scenario.

CW and KH (who it should be noted was on leave during this period), were in contact during this time as to the best course of action. KH was of the opinion that once the 3rd drive had failed there was little realistic hope of recovering any data - to paraphrase "three gone, all gone!"

KH suggested the following new policy for dealing with drive(s) failures:

   1 drive - worried - set RO
   2 drives - very worried, - disable and take out of production    
   

It was confirmed that KH was swapping out drives until the 27th Dec where upon MB took over as on-call. Therefore these error have happened during that week period until the 3rd. The failure rate was considered overly high for that period so we could consider ourselves as unlucky! However, a week without drive observation/monitoring is too long - these drive can and will fail.

It was noted that the failing drives were all from the 2013 batch. The attached table shows the failure rates of drives against batch. Although the values have not been normalized (there number of drives per batch was not available at time of writing), it is still indicative of a higher failure rate of the 2013 batch.

Drv fails.jpeg

The following observations was made

   * Drives with known limit lifetime - believed to be 3 years - were purchased .  It needs to be clarified if these are the 2013 batch? 
   * Replacement drives for these servers are NOT new, they are part of the 10% spares from decommissioning.
   * A brief analysis of the drive server failures show a larger failure rate of 2013 drive that are equipped with Western Digital drives.
   * It has been suggested that there may be some merit in purchasing a number of new drives as spares for any failures.

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Lack of visibility of failing (rather than failed) disk drives. We rely on manual checking (either via looking at nagios output or warning lights) Modify Nagios test to catch failing drives (e.g. error rate ver 1000). No
Disks may be failing owing to end of life. Verify if these disks do have a limited lifetime and are we approaching that figure. No
The use of used drives as spares means that there is an additional risk of drive failures when replacing a drive. This may be very relevant in the case of multiple disk drive failures within a node. Consider some new spare drives. No


Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No