RAL Tier1 Incident 20150417 ElasticTape truncation of input tarballs

From GridPP Wiki
Jump to: navigation, search

RAL-LCG2 Incident 20150417 ElasticTape truncation of input tarballs

Change control for Castor upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=148453

Castor upgrade procedure: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CastorUpgradeTo211415

RT ticket tracking upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=149684

The Castor team were aware of the network intervention but did not include it in their plan as it was believed to be minor.


Impact

Extensive loss/corruption of CEDA data stored in Facilities CASTOR instance

Timeline of the Incident

When What
16/04/2015 ~17:00 Tier 1 first informed of trouble, Andrew S emails round a notice to interested SCD parties.
17/04/2015 09:30 Ad-hoc meeting in Fabric team area, a plan to recall all potentially affected data from tape to disk is formed. 17/04/2015 10:00-19:00 Preparation for recovery operation proceeds. * Spare Facilities hardware is made available to be added to an already-extant CEDA disk pool. ** This hardware is then deployed into CASTOR. ** These are configured such that CASTOR only uses 2 of their 3 partitions, so as to allow a large on-node working area. * A mapping of files to tapes is created to minimise time spent remounting tapes * A tool for mapping internal CASTOR filenames to name server filenames is created so Kevin can identify which file is which.


Incident details

Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date 18th Febuary 2014
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage 2 X 3hour outages, 6hours in total
Status Draft
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No