Difference between revisions of "RAL Tier1 Incident 20170818 first Echo data loss"

From GridPP Wiki
Jump to: navigation, search
(Disk replacement)
(Timeline of the Incident)
Line 20: Line 20:
 
!What
 
!What
 
|-
 
|-
| ''Fri 18/08''
+
| ''Fri 18/08 overnight into Saturday''
| ''BC removed osd.1290 (PG 1.138 primary) as SOP for read errors on disks.''
+
| ''Multiple call-outs due to inconsisent PGs.  BC identified and removed osd.1290 (PG 1.138 primary) as SOP for read errors on disks. BC lowered the min_size setting for the 'atlas' pool on Echo to 8. BC set noscrub, nodeep-scrub on Echo to get inconsistent PGs to start repairing sooner.  Stopped gentle-reweight script from running.''
|-
+
| ''Fri 18/08 unknown time 2''
+
| ''BC lowered the min_size setting for the 'atlas' pool on Echo to 8.''
+
|-
+
| ''Fri 18/08 unknown time 3''
+
| ''BC set noscrub, nodeep-scrub on Echo to get inconsistent PGs to start repairing sooner.''
+
 
|-
 
|-
 
| ''Sat 19/08 11:44''
 
| ''Sat 19/08 11:44''
| ''BC called GV with PG 1.138 already down. We discussed and decided to upgrade Ceph to 11.2.1-0 (see RT #194003).''
+
| ''BC called GV with PG 1.138 already down. We discussed and decided to upgrade Ceph to 11.2.1-0 (see RT #194003). Confirmed with AD via email.''
 
|-
 
|-
 
| ''Mon 21/08 10:39''
 
| ''Mon 21/08 10:39''
| ''GV started testing the Kraken upgrade on dev.''
+
| ''GV started testing the Kraken upgrade on dev. GV confirmed successful upgrade of dev cluster to ceph version 11.2.1-0.''
|-
+
| ''Mon 21/08 11:06''
+
| ''GV confirmed successful upgrade of dev cluster to ceph version 11.2.1-0.''
+
 
|-
 
|-
 
| ''Mon 21/08 15:00''
 
| ''Mon 21/08 15:00''
| ''GV set nodown and norecover to prevent OSDs in problem PG from flapping.''
+
| ''GV set nodown and norecover to prevent OSDs in problem PG from flapping. To stabilize the cluster and allow ongoing backfilling to complete.  GV removed nodown and norecover flags once this had happened.''
|-
+
| ''Mon 21/08 16:20''
+
| ''GV removed nodown and norecover flags.''
+
 
|-
 
|-
 
| ''Tue 22/08 10:12''
 
| ''Tue 22/08 10:12''
| ''JK set Echo endpoints to warning until 23/08 10:03.Same time window for Nagios downtime on Echo health and functional checks.''
+
| ''JK set Echo endpoints to warning until 23/08 10:03. Same time window for Nagios downtime on Echo health and functional checks.''
 
|-
 
|-
 
| ''Tue 22/08 11:22''
 
| ''Tue 22/08 11:22''
| ''GV removed osd.1632 (PG 1.138 new primary - unsure about this) and re-introduced OSDs previously removed by BC (1290, 1713).This was done to try and provide Ceph with enough information to peer the PG.''
+
| ''GV removed osd.1632, which was flapping and re-introduced OSDs previously removed by BC (1290, 1713). This was done to try and provide Ceph with enough information to peer the PG.''
 
|-
 
|-
 
| ''Tue 22/08 14:54''
 
| ''Tue 22/08 14:54''
| ''Echo MONs and any OSDs belonging to PG 1.138 running 11.2.1.''
+
| ''Echo MONs and any OSDs belonging to PG 1.138 upgraded to version 11.2.1-0.''
 
|-
 
|-
 
| ''Tue 22/08 17:20''
 
| ''Tue 22/08 17:20''
| ''GV set online options to prioritise recovery/backfill operations that needed complete in order to advance diagnosis resolution of PG 1.138 problem.''
+
| ''GV set configuration options to prioritise recovery/backfill operations that needed to complete in order to advance diagnosis resolution of PG 1.138 problem.''
 
|-
 
|-
 
| ''Tue 22/08 17:46''
 
| ''Tue 22/08 17:46''
| ''Post-upgrade, the first three OSDs [1290, 927, 672] of PG 1.138 started flapping again. They were stopped and marked out to prevent peering storms in the cluster.''
+
| ''Post-upgrade, the first three OSDs [1290, 927, 672] of PG 1.138 started flapping again with a more relevant error message but same underlying problem. They were stopped and marked out to prevent peering storms in the cluster.''
 
|-
 
|-
 
| ''Wed 23/08''
 
| ''Wed 23/08''
| ''Day was mostly spent researching the problem, waiting on mailing list responses and weighing the options.''
+
| ''AP submitted consultancy request.  GV emailed ceph-users mail list with details of the problem, also requested help via IRC channel.''
 +
|-
 +
| ''Thu 24/08 10:23''
 +
| ''Downtime of entire Echo service for 24 hours declared in GOCDB''
 
|-
 
|-
 
| ''Thu 24/08 11:41''
 
| ''Thu 24/08 11:41''
| ''GV disabled first three OSDs for PG 1.138 so they wouldn't come back upon cluster restart.''
+
| ''Rest of cluster was upgraded to version 11.2.1-0.  GV disabled first three OSDs for PG 1.138 so they wouldn't come back upon cluster restart. GV paused Echo IO to allow for faster recovery. GV restarted all MONs and OSDs on Echo to pick up Kraken 11.2.1-0.''
|-
+
| ''Thu 24/08 11:43''
+
| ''GV paused Echo IO to allow for faster recovery.''
+
|-
+
| ''Thu 24/08 11:46''
+
| ''GV began implementing cluster restart procedure for 11.2.1 update before attempting PG 1.138 resolution.''
+
|-
+
| ''Thu 24/08 12:01''
+
| ''GV restarted all MONs and OSDs on Echo to pick up Kraken 11.2.1.''
+
 
|-
 
|-
 
| ''Thu 24/08 12:15''
 
| ''Thu 24/08 12:15''
Line 81: Line 63:
 
|-
 
|-
 
| ''Thu 24/08 13:37''
 
| ''Thu 24/08 13:37''
| ''GV removed first 3 OSDs [1290, 927, 672] of PG 1.138.''
+
| ''GV removed first 3 OSDs [1290, 927, 672] of PG 1.138 as they were crashing.''
 
|-
 
|-
 
| ''Thu 24/08 17:27''
 
| ''Thu 24/08 17:27''
| ''GV re-introduced removed OSDs [1290, 927, 672] after another fruitless attempt at resolution. The aim of this was to return the cluster to the closest state possible in relation to the problem's occurrence.''
+
| ''GV re-introduced removed OSDs [1290, 927, 672] as it didn't resolve the problem. The aim of this was to return the cluster to the closest state possible in relation to the problem's occurrence.''
 +
|-
 +
| ''Fri 25/08 11:00''
 +
| ''Phone call with Ceph consultant Wido.  He verified the process that GV had gone through.  He verified patching.  Discussed forced recreate of PG.''
 
|-
 
|-
 
| ''Fri 25/08 13:39''
 
| ''Fri 25/08 13:39''
 
| ''AL stopped XRootD on WNs''
 
| ''AL stopped XRootD on WNs''
 
|-
 
|-
| ''Fri 25/08 unknown times''
+
| ''Fri 25/08''
 
| ''It had been decided to take the data loss of 23k ATLAS files and attempt to recover the PG by means destructive to the data. GV stopped PG 1.138's set [1290 672 927 456 177 1094 194 1513 236 302 1326]. OSDs 1290 672 927 1513 had corrupted data. GV started by manually removing the PG from these OSDs datastores. The hope was that the broken PG data could be recreated by EC computation, however, because 4 OSDs had corruption there was not enough data to proceed. The PG went remapped+incomplete. A failed attempt was made to force create the PG anew. The manual removal operation was performed on all OSDs, which still didn't suffice. The PG went into the creating state and blocked there. PG query output indicated history-related issues so a runtime argument was injected into the relevant OSDs to ignore history. The primary was then restarted and the PG went active+clean.''
 
| ''It had been decided to take the data loss of 23k ATLAS files and attempt to recover the PG by means destructive to the data. GV stopped PG 1.138's set [1290 672 927 456 177 1094 194 1513 236 302 1326]. OSDs 1290 672 927 1513 had corrupted data. GV started by manually removing the PG from these OSDs datastores. The hope was that the broken PG data could be recreated by EC computation, however, because 4 OSDs had corruption there was not enough data to proceed. The PG went remapped+incomplete. A failed attempt was made to force create the PG anew. The manual removal operation was performed on all OSDs, which still didn't suffice. The PG went into the creating state and blocked there. PG query output indicated history-related issues so a runtime argument was injected into the relevant OSDs to ignore history. The primary was then restarted and the PG went active+clean.''
 
|}
 
|}
  
Since the Placement Group was recreated and the data declared lost, there have been a number of residual issues while attempting to return the cluster to normal operational status.
+
Since the Placement Group was recreated and the data declared lost, there have been a number of residual issues while attempting to return the cluster to normal operational status.  Over the long August Bank Holiday weekend, Echo had scrubbing turned off and no-out both of which are essential to long term health of cluster.
  
 
{|border="1",cellpadding="1"
 
{|border="1",cellpadding="1"
Line 100: Line 85:
 
!When
 
!When
 
!What
 
!What
 +
|-
 +
| ''Wed 30/08''
 +
| ''Removed no-out and upgraded gateways to 11.2.1-0.  Scrubbing was not turned back on (On Friday it had been agreed that it should have been switched on, on Wednesday).''
 
|-
 
|-
 
| ''Thu 31/08 16:19''
 
| ''Thu 31/08 16:19''
| ''TB set OSD 1713 and 726 CRUSH weight to 0 for replacement due to disk errors''
+
| ''TB set OSD 1713 and 726 CRUSH weight to 0 for replacement due to disk errors.  This was not the normal way to remove OSDs from the cluster and in hindsight was an error.  BC turned scrubbing back on.''
 
|-
 
|-
 
| ''Thu 31/08 16:48''
 
| ''Thu 31/08 16:48''
| ''OSDs started flapping, due to 1713 and 726 participating in the backfill of PG 1.156 and 1.652 and encountering read errors. Both PGs quickly became incomplete''
+
| ''OSDs started flapping, due to 1713 and 726 participating in the backfill of PG 1.156 and 1.652 and encountering read errors. Both PGs quickly became incomplete.''
 
|-
 
|-
 
| ''Fri 01/09 09:30''
 
| ''Fri 01/09 09:30''
| ''The bug which caused read errors to the crash the primary was discovered while looking through the logs of osd.17 and 1713''
+
| '''''The bug which caused read errors to the crash the primary was discovered while looking through the logs of osd.17 and 1713.'''''
 
|-
 
|-
 
| ''Fri 01/09 12:00''
 
| ''Fri 01/09 12:00''

Revision as of 13:58, 21 September 2017

Data Loss from RAL-LCG2-ECHO

Description:

On August 18th, PG 1.138 on the ATLAS pool went ran into a Ceph bug related to backfilling EC pools. The PG was in backfill as it was part of the data re-balancing after the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary OSD attempting to move a corrupted file. This caused the first three OSDs in the set to crash and start flapping until systemd gave up trying to restart them. The PG remained in a down state and could not serve IO directed at it resulting in the failure of any transfers of files with a stripe that would be placed in that PG. After trying to recover the PG to an operational for a week with no success, it was decided to write off the PG and its data (23k ATLAS files) and recreate it to make the service fully available again.

Impact

Service: Echo VOs affected: ATLAS Data Loss: Yes - 23k files.


Timeline of the Incident

When What
Fri 18/08 overnight into Saturday Multiple call-outs due to inconsisent PGs. BC identified and removed osd.1290 (PG 1.138 primary) as SOP for read errors on disks. BC lowered the min_size setting for the 'atlas' pool on Echo to 8. BC set noscrub, nodeep-scrub on Echo to get inconsistent PGs to start repairing sooner. Stopped gentle-reweight script from running.
Sat 19/08 11:44 BC called GV with PG 1.138 already down. We discussed and decided to upgrade Ceph to 11.2.1-0 (see RT #194003). Confirmed with AD via email.
Mon 21/08 10:39 GV started testing the Kraken upgrade on dev. GV confirmed successful upgrade of dev cluster to ceph version 11.2.1-0.
Mon 21/08 15:00 GV set nodown and norecover to prevent OSDs in problem PG from flapping. To stabilize the cluster and allow ongoing backfilling to complete. GV removed nodown and norecover flags once this had happened.
Tue 22/08 10:12 JK set Echo endpoints to warning until 23/08 10:03. Same time window for Nagios downtime on Echo health and functional checks.
Tue 22/08 11:22 GV removed osd.1632, which was flapping and re-introduced OSDs previously removed by BC (1290, 1713). This was done to try and provide Ceph with enough information to peer the PG.
Tue 22/08 14:54 Echo MONs and any OSDs belonging to PG 1.138 upgraded to version 11.2.1-0.
Tue 22/08 17:20 GV set configuration options to prioritise recovery/backfill operations that needed to complete in order to advance diagnosis resolution of PG 1.138 problem.
Tue 22/08 17:46 Post-upgrade, the first three OSDs [1290, 927, 672] of PG 1.138 started flapping again with a more relevant error message but same underlying problem. They were stopped and marked out to prevent peering storms in the cluster.
Wed 23/08 AP submitted consultancy request. GV emailed ceph-users mail list with details of the problem, also requested help via IRC channel.
Thu 24/08 10:23 Downtime of entire Echo service for 24 hours declared in GOCDB
Thu 24/08 11:41 Rest of cluster was upgraded to version 11.2.1-0. GV disabled first three OSDs for PG 1.138 so they wouldn't come back upon cluster restart. GV paused Echo IO to allow for faster recovery. GV restarted all MONs and OSDs on Echo to pick up Kraken 11.2.1-0.
Thu 24/08 12:15 BC stopped and disabled all xrootd, gridftp and ceph-radosgw processes on Echo gateways to prevent IO requests from reaching Echo.
Thu 24/08 13:20 All of Echo (except gateways) running 11.2.1.
Thu 24/08 13:37 GV removed first 3 OSDs [1290, 927, 672] of PG 1.138 as they were crashing.
Thu 24/08 17:27 GV re-introduced removed OSDs [1290, 927, 672] as it didn't resolve the problem. The aim of this was to return the cluster to the closest state possible in relation to the problem's occurrence.
Fri 25/08 11:00 Phone call with Ceph consultant Wido. He verified the process that GV had gone through. He verified patching. Discussed forced recreate of PG.
Fri 25/08 13:39 AL stopped XRootD on WNs
Fri 25/08 It had been decided to take the data loss of 23k ATLAS files and attempt to recover the PG by means destructive to the data. GV stopped PG 1.138's set [1290 672 927 456 177 1094 194 1513 236 302 1326]. OSDs 1290 672 927 1513 had corrupted data. GV started by manually removing the PG from these OSDs datastores. The hope was that the broken PG data could be recreated by EC computation, however, because 4 OSDs had corruption there was not enough data to proceed. The PG went remapped+incomplete. A failed attempt was made to force create the PG anew. The manual removal operation was performed on all OSDs, which still didn't suffice. The PG went into the creating state and blocked there. PG query output indicated history-related issues so a runtime argument was injected into the relevant OSDs to ignore history. The primary was then restarted and the PG went active+clean.

Since the Placement Group was recreated and the data declared lost, there have been a number of residual issues while attempting to return the cluster to normal operational status. Over the long August Bank Holiday weekend, Echo had scrubbing turned off and no-out both of which are essential to long term health of cluster.

When What
Wed 30/08 Removed no-out and upgraded gateways to 11.2.1-0. Scrubbing was not turned back on (On Friday it had been agreed that it should have been switched on, on Wednesday).
Thu 31/08 16:19 TB set OSD 1713 and 726 CRUSH weight to 0 for replacement due to disk errors. This was not the normal way to remove OSDs from the cluster and in hindsight was an error. BC turned scrubbing back on.
Thu 31/08 16:48 OSDs started flapping, due to 1713 and 726 participating in the backfill of PG 1.156 and 1.652 and encountering read errors. Both PGs quickly became incomplete.
Fri 01/09 09:30 The bug which caused read errors to the crash the primary was discovered while looking through the logs of osd.17 and 1713.
Fri 01/09 12:00 TB removed problem OSDs that have either caused crashes or are reporting many disk errors, and restarted all the crashed OSDs. All PGs had become active, and any new crashing OSDs were diagnosed and the offending OSD removed.
Fri 01/09 17:13 Gateways were re-enabled (external and internal)
Fri 01/09 22:38 Another incomplete PG due to the bug occurred, dealt with by TB with minimal disruption to service
Sat 02/09 10:00 Backfilling finished, the cluster was back to a healthy state
Tues 05/09 13:35 TB removed OSDs with new disk errors triggers backfilling and a new crashing OSD, again dealt with no disruption
Tues 19/09 15:00 During disk removal by TB, primary starts flapping due to the EC backfilling bug, dealt with by the removal of the problem OSD with no disruption

Incident details

Put a reasonably detailed description of the incident here.


Analysis

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

Adding New Hardware

At the start of August the remaining 15 generation disk servers (25) that had not been put into production in March were added to the Cluster. Unlike with Castor, Ceph will automatically balance data across the servers. If disk servers are added one at a time, the rebalancing load will all be focused on one node, so the data rebalancing will either occur very slowly or place significant load on the machine. It will also result in a significant amount more rebalancing taking place as data will be moved between the new disk servers as they are deployed.


Disk replacement

Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes