RAL Tier1 Incident 20170818 first Echo data loss

From GridPP Wiki
Jump to: navigation, search

TITLE - include RAL-LCG2 in title

Description:

Put here a reasonable description of the event. Ensure you include which service etc. is affected.

Impact

Describe the type of impact. Include which services / VOs. How long they were impacted for and give the dates. If data loss ensure this is clearly flagged.


Timeline of the Incident

When What
Date & maybe time e.g. 20th July 09:00 Blah Team did something

|- | Fri 18/08 unknown time 1 | BC removed osd.1290 (PG 1.138 primary) as SOP for read errors on disks. |} Fri 18/08 unknown time 2 BC lowered the min_size setting for the 'atlas' pool on Echo to 8.

Fri 18/08 unknown time 3 BC set noscrub, nodeep-scrub on Echo to get inconsistent PGs to start repairing sooner.

Sat 19/08 11:44 BC called GV with PG 1.138 already down. We discussed and decided to upgrade Ceph to 11.2.1-0 (see RT #194003).

Mon 21/08 10:39 GV started testing the Kraken upgrade on dev.

Mon 21/08 11:06 GV confirmed successful upgrade of dev cluster to ceph version 11.2.1-0.

Mon 21/08 15:00 GV set nodown and norecover to prevent OSDs in problem PG from flapping.

Mon 21/08 16:20 GV removed nodown and norecover flags

Tue 22/08 10:12 JK set Echo endpoints to warning until 23/08 10:03. Same time window for Nagios downtime on Echo health and functional checks.

Tue 22/08 11:22 GV removed osd.1632 (PG 1.138 new primary - unsure about this) and re-introduced OSDs previously removed by BC (1290, 1713). This was done to try and provide Ceph with enough information to peer the PG.

Tue 22/08 14:54 Echo MONs and any OSDs belonging to PG 1.138 running 11.2.1.

Tue 22/08 17:20 GV set online options to prioritise recovery/backfill operations that needed complete in order to advance diagnosis resolution of PG 1.138 problem.

Tue 22/08 17:46 Post-upgrade, the first three OSDs [1290, 927, 672] of PG 1.138 started flapping again. They were stopped and marked out to prevent peering storms in the cluster.

Wed 23/08 Day was mostly spent researching the problem, waiting on mailing list responses and weighing the options.

Thu 24/08 11:41 GV disabled first three OSDs for PG 1.138 so they wouldn't come back upon cluster restart.

Thu 24/08 11:43 GV paused Echo IO to allow for faster recovery.

Thu 24/08 11:46 GV began implementing cluster restart procedure for 11.2.1 update before attempting PG 1.138 resolution. Thu 24/08 12:01 GV restarted all MONs and OSDs on Echo to pick up Kraken 11.2.1

Thu 24/08 12:15 BC stopped and disabled all xrootd, gridftp and ceph-radosgw processes on Echo gateways to prevent IO requests from reaching Echo

Thu 24/08 13:20 All of Echo (except gateways) running 11.2.1.

Thu 24/08 13:37 GV removed first 3 OSDs [1290, 927, 672] of PG 1.138.

Thu 24/08 17:27 GV re-introduced removed OSDs [1290, 927, 672] after another fruitless attempt at resolution. The aim of this was to return the cluster to the closest state possible in relation to the problem's occurrence.

Fri 25/08 13:39 AL stopped XRootD on WNs

Fri 25/08 unknown times It had been decided to take the data loss of 23k ATLAS files and attempt to recover the PG by means destructive to the data. GV stopped PG 1.138's set [1290 672 927 456 177 1094 194 1513 236 302 1326]. OSDs 1290 672 927 1513 had corrupted data. GV started by manually removing the PG from these OSDs datastores. The hope was that the broken PG data could be recreated by EC computation, however, because 4 OSDs had corruption there was not enough data to proceed. The PG went remapped+incomplete. A failed attempt was made to force create the PG anew. The manual removal operation was performed on all OSDs, which still didn't suffice. The PG went into the creating state and blocked there. PG query output indicated history-related issues so a runtime argument was injected into the relevant OSDs to ignore history. The primary was then restarted and the PG went active+clean.



Incident details

Put a reasonably detailed description of the incident here.


Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No