RAL Tier1 Incident 20180103 Atlas Data Server Failure

From GridPP Wiki
Revision as of 10:28, 15 February 2018 by John Kelly f76decd405 (Talk | contribs)

Jump to: navigation, search

TITLE - include RAL-LCG2 in title

Description:

Wed 3rd Jan gdss745 (atlasStripInput d1t0) started reporting a large number of errors (3 drives errors and 1 virtual drive off-line). Sever was taken out of production same day. Work on the server was unsuccessfully attempted for the next 5 days. on finding the CASTOR partitioning corrupt, total data loss declared on the 8th.

Impact

The complete loss of server resulted in the loss of ~4.4 million files for ATLAS.

Timeline of the Incident

When What
Date & maybe time e.g. 20th July 09:00 Blah Team did something
3rd Jan 2018 - 09:34 gdss745 reporting high media error rate
3rd Jan 2018 - 14:36 gdss745 put to RO in an attempt to reduce loading.
3rd Jan 2018 - 14:48 CASTOR set to RO by JPK
3rd Jan 2018 - 15:41 gdss745 taken out of production as per CW request. ATLAS informed of server off-line - JPK
4th Jan 2018 - 10:33 CW analysis 3 physical drives down and a 1 virtual drive off-line, "Drive is a complete mess! 4th Jan 2018 - 13:29 Filelist of temporarily unavailable files sent to Atlas - JPK
5th Jan 2018 - 08:55 Another drive replaced and rebuild started
8th Jan 2018 - 10:27 CASTOR partitioning corrupted, complete data loss confirmed. ATLAS informed of dataloss

Incident details

Put a reasonably detailed description of the incident here.

A few notes from John Kelly The main RT ticket for this is RT200396. I note that the data loss appears to have been dealt with in RT 200429 I also note that there doesn't appear to be an elog for this.


Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No