Difference between revisions of "RAL Tier1 Incident 20170818 first Echo data loss"

From GridPP Wiki
Jump to: navigation, search
(Disk replacement)
(Disk replacement)
Line 161: Line 161:
 
From the beginning of this year it was decided that the Echo cluster could not operate reliably if there were any disk issues.  A procedure for handling disk issues was created https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/EchoDiskReplacement
 
From the beginning of this year it was decided that the Echo cluster could not operate reliably if there were any disk issues.  A procedure for handling disk issues was created https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/EchoDiskReplacement
  
and has been followed by the Ceph team, involving the Fabric team in the actual disk replacement, after the Ceph team have removed the disk from the cluster.  With the large number of disks in the cluster the rate of replacement was fairly high and, in July, after a meeting between the Fabric Team and the Ceph Team an alternative to the Fabric Team requesting a replacement disk for the Ceph servers if only 1 error had been seen was proposed by Martin Bly:
+
and has been followed by the Ceph team, involving the Fabric team in the actual disk replacement, after the Ceph team have removed the disk from the cluster.  With the large number of disks in the cluster the rate of replacement was fairly high and, in July, after a meeting between the Fabric Team and the Ceph Team an alternative to the Fabric Team requesting a replacement disk for the Ceph servers if only 1 error had been seen was proposed by the Fabric team manager:
  
 
"To deal with media errors on a disk, the following is proposed as a way forward:
 
"To deal with media errors on a disk, the following is proposed as a way forward:

Revision as of 16:17, 22 September 2017

Data Loss from RAL-LCG2-ECHO

Description:

On August 18th, PG 1.138 on the ATLAS pool went ran into a Ceph bug related to backfilling EC pools. The PG was in backfill as it was part of the data re-balancing after the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary OSD attempting to move a corrupted file. This caused the first three OSDs in the set to crash and start flapping until systemd gave up trying to restart them. The PG remained in a down state and could not serve IO directed at it resulting in the failure of any transfers of files with a stripe that would be placed in that PG. After trying to recover the PG to an operational for a week with no success, it was decided to write off the PG and its data (23k ATLAS files) and recreate it to make the service fully available again.

Impact

Service: Echo VOs affected: ATLAS Data Loss: Yes - 23k files.


Timeline of the Incident

When What
Fri 18/08 overnight into Saturday Multiple call-outs due to inconsisent PGs. BC identified and removed osd.1290 (PG 1.138 primary) as SOP for read errors on disks. BC lowered the min_size setting for the 'atlas' pool on Echo to 8. BC set noscrub, nodeep-scrub on Echo to get inconsistent PGs to start repairing sooner. Stopped gentle-reweight script from running.
Sat 19/08 11:44 BC called GV with PG 1.138 already down. We discussed and decided to upgrade Ceph to 11.2.1-0 (see RT #194003). Confirmed with AD via email.
Mon 21/08 10:39 GV started testing the Kraken upgrade on dev. GV confirmed successful upgrade of dev cluster to ceph version 11.2.1-0.
Mon 21/08 15:00 GV set nodown and norecover to prevent OSDs in problem PG from flapping. To stabilize the cluster and allow ongoing backfilling to complete. GV removed nodown and norecover flags once this had happened.
Tue 22/08 10:12 JK set Echo endpoints to warning until 23/08 10:03. Same time window for Nagios downtime on Echo health and functional checks.
Tue 22/08 11:22 GV removed osd.1632, which was flapping and re-introduced OSDs previously removed by BC (1290, 1713). This was done to try and provide Ceph with enough information to peer the PG.
Tue 22/08 14:54 Echo MONs and any OSDs belonging to PG 1.138 upgraded to version 11.2.1-0.
Tue 22/08 17:20 GV set configuration options to prioritise recovery/backfill operations that needed to complete in order to advance diagnosis resolution of PG 1.138 problem.
Tue 22/08 17:46 Post-upgrade, the first three OSDs [1290, 927, 672] of PG 1.138 started flapping again with a more relevant error message but same underlying problem. They were stopped and marked out to prevent peering storms in the cluster.
Wed 23/08 AP submitted consultancy request. GV emailed ceph-users mail list with details of the problem, also requested help via IRC channel.
Thu 24/08 10:23 Downtime of entire Echo service for 24 hours declared in GOCDB
Thu 24/08 11:41 Rest of cluster was upgraded to version 11.2.1-0. GV disabled first three OSDs for PG 1.138 so they wouldn't come back upon cluster restart. GV paused Echo IO to allow for faster recovery. GV restarted all MONs and OSDs on Echo to pick up Kraken 11.2.1-0.
Thu 24/08 12:15 BC stopped and disabled all xrootd, gridftp and ceph-radosgw processes on Echo gateways to prevent IO requests from reaching Echo.
Thu 24/08 13:20 All of Echo (except gateways) running 11.2.1.
Thu 24/08 13:37 GV removed first 3 OSDs [1290, 927, 672] of PG 1.138 as they were crashing.
Thu 24/08 17:27 GV re-introduced removed OSDs [1290, 927, 672] as it didn't resolve the problem. The aim of this was to return the cluster to the closest state possible in relation to the problem's occurrence.
Fri 25/08 11:00 Phone call with Ceph consultant Wido. He verified the process that GV had gone through. He verified patching. Discussed forced recreate of PG.
Fri 25/08 13:39 AL stopped XRootD on WNs
Fri 25/08 It had been decided to take the data loss of 23k ATLAS files and attempt to recover the PG by means destructive to the data. GV stopped PG 1.138's set [1290 672 927 456 177 1094 194 1513 236 302 1326]. OSDs 1290 672 927 1513 had corrupted data. GV started by manually removing the PG from these OSDs datastores. The hope was that the broken PG data could be recreated by EC computation, however, because 4 OSDs had corruption there was not enough data to proceed. The PG went remapped+incomplete. A failed attempt was made to force create the PG anew. The manual removal operation was performed on all OSDs, which still didn't suffice. The PG went into the creating state and blocked there. PG query output indicated history-related issues so a runtime argument was injected into the relevant OSDs to ignore history. The primary was then restarted and the PG went active+clean.

Since the Placement Group was recreated and the data declared lost, there have been a number of residual issues while attempting to return the cluster to normal operational status. Over the long August Bank Holiday weekend, Echo had scrubbing turned off and no-out both of which are essential to long term health of cluster.

When What
Wed 30/08 Removed no-out and upgraded gateways to 11.2.1-0. Scrubbing was not turned back on (On Friday it had been agreed that it should have been switched on, on Wednesday).
Thu 31/08 16:19 TB set OSD 1713 and 726 CRUSH weight to 0 for replacement due to disk errors. This was not the normal way to remove OSDs from the cluster and in hindsight was an error. BC turned scrubbing back on.
Thu 31/08 16:48 OSDs started flapping, due to 1713 and 726 participating in the backfill of PG 1.156 and 1.652 and encountering read errors. Both PGs quickly became incomplete. BC turned scrubbing back off.
Thu 31/08 18:30 Downtime for entire Echo service was declared until Midday on 1st September, which was subsequently extended to 17:00.
Fri 01/09 09:30 The bug which caused read errors to the crash the primary was discovered while looking through the logs of osd.17 and 1713.
Fri 01/09 12:00 TB removed problem OSDs that have either caused crashes or are reporting many disk errors, and restarted all the crashed OSDs. All PGs had become active, and any new crashing OSDs were diagnosed and the offending OSD removed.
Fri 01/09 17:00 Gateways were re-enabled (external and internal).
Fri 01/09 22:38 Another incomplete PG due to the bug occurred, dealt with by TB with minimal disruption to service
Sat 02/09 10:00 Backfilling of the recently removed OSDs had finished, the cluster was back to a healthy state!
Tues 05/09 13:35 During disk removal by TB, primary starts flapping due to the EC backfilling bug, dealt with by the removal of the problem OSD with no disruption.
Thurs 07/09 TB renable scrubbing (but not deep-scrubbing). During the rest of the working day, four inconsistent PGs were found. They were fixed using SOP. Decision to disable scrubbing outside of office hours was made.
Tues 19/09 15:00 During disk removal by TB, primary starts flapping due to the EC backfilling bug, dealt with by the removal of the problem OSD with no disruption.

Incident details

Put a reasonably detailed description of the incident here.


Analysis

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

Adding New Hardware

At the start of August the remaining 15 generation disk servers (25) that had not been put into production in March were added to the Cluster. Unlike with Castor, Ceph will automatically balance data across the servers. If disk servers are added one at a time, the rebalancing load will all be focused on one node, so the data rebalancing will either occur very slowly or place significant load on the machine. It will also result in a significant amount more rebalancing taking place as data will be moved between the new disk servers as they are deployed.


Disk replacement

From the beginning of this year it was decided that the Echo cluster could not operate reliably if there were any disk issues. A procedure for handling disk issues was created https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/EchoDiskReplacement

and has been followed by the Ceph team, involving the Fabric team in the actual disk replacement, after the Ceph team have removed the disk from the cluster. With the large number of disks in the cluster the rate of replacement was fairly high and, in July, after a meeting between the Fabric Team and the Ceph Team an alternative to the Fabric Team requesting a replacement disk for the Ceph servers if only 1 error had been seen was proposed by the Fabric team manager:

"To deal with media errors on a disk, the following is proposed as a way forward: When a media error occurs, the disk should be ejected from Echo and reformatted to force a block remap, then returned to Echo. This can be automated using a script and driven by the Ceph team. When a disk has accumulated a media error count above a suitable limit that would enable the disk to be swapped out by the supplier, the disk should be replaced with a new disk using the existing procedure."

It appears that this alternative procedure was not put in place and that disks with media errors had been accumulating in the ensuing weeks, it's not clear why as the disks could still have been removed from the cluster even if the next part of the procedure was still to be resolved.

The need to remove disks from the cluster if they have any media errors has been reiterated and the Ceph team will work with the Fabric team to update the procedure as necessary to reflect any changes needed in the process of managing disk errors.

Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes