RAL Tier1 Incident 20150417 ElasticTape truncation of input tarballs
From GridPP Wiki
Revision as of 13:54, 24 April 2015 by Rob Appleyard 7822b28575 (Talk | contribs)
Contents
RAL-LCG2 Incident 20150417 ElasticTape truncation of input tarballs
Elastictape was found to be truncating the tar files that it uses to encapsulate CEDA user data. It was therefore necessary to unpack everything (~500TB) stored in this system and examine the tar files to determine what files had been lost/damaged.
Impact
Extensive loss/corruption of CEDA data stored in Facilities CASTOR instance
Timeline of the Incident
When | What |
---|---|
16/04/2015 ~17:00 | Tier 1 first informed of trouble, Andrew S emails round a notice to interested SCD parties. |
17/04/2015 09:30 | Ad-hoc meeting in Fabric team area, a plan to recall all potentially affected data from tape to disk is formed. |
17/04/2015 10:00-19:00 | Preparation for recovery operation proceeds.
* Spare Facilities hardware is made available to be added to an already-extant CEDA disk pool. ** This hardware is then deployed into CASTOR. ** These are configured such that CASTOR only uses 2 of their 3 partitions, so as to allow a large on-node working area. * A mapping of files to tapes is created to minimise time spent remounting tapes * A tool for mapping internal CASTOR filenames to name server filenames is created to make it possible to identify CASTOR files. |
17/04/2015 16:45 (approx) | Recall of at-risk data from tape to cedaRetrieve disk pool started. The disk pool was, at this point, too small to contain all the data that needed to be recovered, but the operation was expected to take all weekend, and more hardware was being prepared. A data rate of a little below 2GB/s into CASTOR was achieved. |
17/04/2015 1900 | Final hardware deployed into cedaRetrieve |
21/04/2015 0200 | All data recalled from tape. |
Incident details
Analysis
This section to include a breakdown of what happened. Include any related issues.
Follow Up
This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.
Issue | Response | Done |
---|---|---|
Issue 1 | Mitigation for issue 1. | Done yes/no |
Issue 2 | Mitigation for issue 2. | Done yes/no |
Related issues
List any related issue and provide links if possible. If there are none then remove this section.
Reported by: Your Name at date/time
Summary Table
Start Date | 18th Febuary 2014 |
Impact | Select one of: >80%, >50%, >20%, <20% |
Duration of Outage | 2 X 3hour outages, 6hours in total |
Status | Draft |
Root Cause | Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load |
Data Loss | Yes/No |