RAL Tier1 Incident 20180103 Atlas Data Server Failure
Contents
TITLE - include RAL-LCG2 in title
Description:
Wed 3rd Jan gdss745 (atlasStripInput d1t0) started reporting a large number of errors (3 drives errors and 1 virtual drive off-line). Sever was taken out of production same day. Work on the server was unsuccessfully attempted for the next 5 days. on finding the CASTOR partitioning corrupt, total data loss declared on the 8th.
Impact
The complete loss of server resulted in the loss of ~4.4 million files for ATLAS.
Timeline of the Incident
When | What |
---|---|
Date & maybe time e.g. 20th July 09:00 | Blah Team did something |
3rd Jan 2018 - 09:34 | Blah Team did something
|
Incident details
Put a reasonably detailed description of the incident here.
Analysis
This section to include a breakdown of what happened. Include any related issues.
Follow Up
This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.
Issue | Response | Done |
---|---|---|
Issue 1 | Mitigation for issue 1. | Done yes/no |
Issue 2 | Mitigation for issue 2. | Done yes/no |
Related issues
List any related issue and provide links if possible. If there are none then remove this section.
Reported by: Your Name at date/time
Summary Table
Start Date | Date e.g. 20 July 2010 |
Impact | Select one of: >80%, >50%, >20%, <20% |
Duration of Outage | Hours e.g. 3hours |
Status | select one from Draft, Open, Understood, Closed |
Root Cause | Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load |
Data Loss | Yes/No |