RAL Tier1 Incident 20170818 first Echo data loss

Data Loss from RAL-LCG2-ECHO

Description:

On August 18th, PG 1.138 on the ATLAS pool went ran into a Ceph bug related to backfilling EC pools. The PG was in backfill as it was part of the data re-balancing after the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary OSD attempting to move a corrupted file. This caused the first three OSDs in the set to crash and start flapping until systemd gave up trying to restart them. The PG remained in a down state and could not serve IO directed at it resulting in the failure of any transfers of files with a stripe that would be placed in that PG. After trying to recover the PG to an operational for a week with no success, it was decided to write off the PG and its data (23k ATLAS files) and recreate it to make the service fully available again.

Impact

Service: Echo VOs affected: ATLAS Data Loss: Yes - 23k files.

Timeline of the Incident


When	What
Fri 18/08	BC removed osd.1290 (PG 1.138 primary) as SOP for read errors on disks.
Fri 18/08 unknown time 2	BC lowered the min_size setting for the 'atlas' pool on Echo to 8.
Fri 18/08 unknown time 3	BC set noscrub, nodeep-scrub on Echo to get inconsistent PGs to start repairing sooner.
Sat 19/08 11:44	BC called GV with PG 1.138 already down. We discussed and decided to upgrade Ceph to 11.2.1-0 (see RT #194003).
Mon 21/08 10:39	GV started testing the Kraken upgrade on dev.
Mon 21/08 11:06	GV confirmed successful upgrade of dev cluster to ceph version 11.2.1-0.
Mon 21/08 15:00	GV set nodown and norecover to prevent OSDs in problem PG from flapping.
Mon 21/08 16:20	GV removed nodown and norecover flags.
Tue 22/08 10:12	JK set Echo endpoints to warning until 23/08 10:03.Same time window for Nagios downtime on Echo health and functional checks.
Tue 22/08 11:22	GV removed osd.1632 (PG 1.138 new primary - unsure about this) and re-introduced OSDs previously removed by BC (1290, 1713).This was done to try and provide Ceph with enough information to peer the PG.
Tue 22/08 14:54	Echo MONs and any OSDs belonging to PG 1.138 running 11.2.1.
Tue 22/08 17:20	GV set online options to prioritise recovery/backfill operations that needed complete in order to advance diagnosis resolution of PG 1.138 problem.
Tue 22/08 17:46	Post-upgrade, the first three OSDs [1290, 927, 672] of PG 1.138 started flapping again. They were stopped and marked out to prevent peering storms in the cluster.
Wed 23/08	Day was mostly spent researching the problem, waiting on mailing list responses and weighing the options.
Thu 24/08 11:41	GV disabled first three OSDs for PG 1.138 so they wouldn't come back upon cluster restart.
Thu 24/08 11:43	GV paused Echo IO to allow for faster recovery.
Thu 24/08 11:46	GV began implementing cluster restart procedure for 11.2.1 update before attempting PG 1.138 resolution.
Thu 24/08 12:01	GV restarted all MONs and OSDs on Echo to pick up Kraken 11.2.1.
Thu 24/08 12:15	BC stopped and disabled all xrootd, gridftp and ceph-radosgw processes on Echo gateways to prevent IO requests from reaching Echo.
Thu 24/08 13:20	All of Echo (except gateways) running 11.2.1.
Thu 24/08 13:37	GV removed first 3 OSDs [1290, 927, 672] of PG 1.138.
Thu 24/08 17:27	GV re-introduced removed OSDs [1290, 927, 672] after another fruitless attempt at resolution. The aim of this was to return the cluster to the closest state possible in relation to the problem's occurrence.
Fri 25/08 13:39	AL stopped XRootD on WNs
Fri 25/08 unknown times	It had been decided to take the data loss of 23k ATLAS files and attempt to recover the PG by means destructive to the data. GV stopped PG 1.138's set [1290 672 927 456 177 1094 194 1513 236 302 1326]. OSDs 1290 672 927 1513 had corrupted data. GV started by manually removing the PG from these OSDs datastores. The hope was that the broken PG data could be recreated by EC computation, however, because 4 OSDs had corruption there was not enough data to proceed. The PG went remapped+incomplete. A failed attempt was made to force create the PG anew. The manual removal operation was performed on all OSDs, which still didn't suffice. The PG went into the creating state and blocked there. PG query output indicated history-related issues so a runtime argument was injected into the relevant OSDs to ignore history. The primary was then restarted and the PG went active+clean.

Since the Placement Group was recreated and the data declared lost, there have been a number of residual issues while attempting to return the cluster to normal operational status.


When	What
Thu 31/08 16:19	TB set OSD 1713 and 726 CRUSH weight to 0 for replacement due to disk errors
Thu 31/08 16:48	OSDs started flapping, due to 1713 and 726 participating in the backfill of PG 1.156 and 1.652 and encountering read errors. Both PGs quickly became incomplete
Fri 01/09 09:30	The bug which caused read errors to the crash the primary was discovered while looking through the logs of osd.17 and 1713
Fri 01/09 12:00	TB removed problem OSDs that have either caused crashes or are reporting many disk errors, and restarted all the crashed OSDs. All PGs had become active, and any new crashing OSDs were diagnosed and the offending OSD removed.
Fri 01/09 17:13	Gateways were re-enabled (external and internal)
Fri 01/09 22:38	Another incomplete PG due to the bug occurred, dealt with by TB with minimal disruption to service
Sat 02/09 10:00	Backfilling finished, the cluster was back to a healthy state
Tues 05/09 13:35	TB removed OSDs with new disk errors triggers backfilling and a new crashing OSD, again dealt with no disruption
Tues 19/09 15:00	During disk removal by TB, primary starts flapping due to the EC backfilling bug, dealt with by the removal of the problem OSD with no disruption

Incident details

Put a reasonably detailed description of the incident here.

Analysis

Replacing failing disk drives

Adding New Hardware

At the start of August the remaining 15 generation disk servers (25) that had not been put into production in March were added to the Cluster. Unlike with Castor, Ceph will automatically balance data across the servers. If disk servers are added one at a time, the rebalancing load will all be focused on one node, so the data rebalancing will either occur very slowly or place significant load on the machine. It will also result in a significant amount more rebalancing taking place as data will be moved between the new disk servers as they are deployed.

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue	Response	Done
Issue 1	Mitigation for issue 1.	Done yes/no
Issue 2	Mitigation for issue 2.	Done yes/no

Related issues

Disk replacement

Introduction of new SNs, particularly ones that have been idling for a while.

Reported by: Your Name at date/time

Summary Table


Start Date	Date e.g. 20 July 2010
Impact	Select one of: >80%, >50%, >20%, <20%
Duration of Outage	Hours e.g. 3hours
Status	select one from Draft, Open, Understood, Closed
Root Cause	Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss	Yes

RAL Tier1 Incident 20170818 first Echo data loss

Contents

Data Loss from RAL-LCG2-ECHO

Description:

Impact

Timeline of the Incident

Incident details

Analysis

Replacing failing disk drives

Adding New Hardware

Follow Up

Related issues

Reported by: Your Name at date/time

Summary Table

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools