Difference between revisions of "RAL Tier1 Incident 20170818 first Echo data loss"

From GridPP Wiki
Jump to: navigation, search
(Disk replacement)
(Detailed report)
 
(6 intermediate revisions by 3 users not shown)
Line 4: Line 4:
 
===Description:===
 
===Description:===
  
On August 18th, PG 1.138 on the ATLAS pool went ran into a Ceph bug related to backfilling EC pools. The PG was in backfill as it was part of the data re-balancing after the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary OSD attempting to move a corrupted file. This caused the first three OSDs in the set to crash and start flapping until systemd gave up trying to restart them. The PG remained in a ''down'' state and could not serve IO directed at it resulting in the failure of any transfers of files with a stripe that would be placed in that PG. After trying to recover the PG to an operational for a week with no success, it was decided to write off the PG and its data (23k ATLAS files) and recreate it to make the service fully available again.
+
On August 18th, a placement group (PG) on the ATLAS pool ran into a Ceph bug related to backfilling Erasure Coding pools. A placement group is a logical set of object storage daemons that store a subset of the data in that pool. Object storage daemons are the main building block of a Ceph cluster, and are responsible for storing objects on a local disk, and communicating and exchanging data with other OSDs. The PG was backfilling data onto a new OSD following the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary OSD attempting to move a corrupted file. This caused the first three OSDs in the set to crash and start flapping until systemd gave up trying to restart them. The PG remained in a ''down'' state and could not serve IO directed at it resulting in the failure of any transfers of files with a stripe was (or was going to be) stored in that PG. After trying to recover the PG to an operational state for a week with no success, it was decided to write off the PG and its data (23k ATLAS files) and recreate it to make the service fully available again.
  
 
===Impact===
 
===Impact===
Line 125: Line 125:
 
===Incident details===
 
===Incident details===
  
''Put a reasonably detailed description of the incident here.''
+
==== Summary ====
 +
On August 18th, PG 1.138 on the ATLAS pool went ran into a Ceph bug related to backfilling EC pools. The PG was in backfill as it was part of the data re-balancing after the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary OSD attempting to move a corrupted file. This caused the first three OSDs in the set to crash and start flapping until systemd gave up trying to restart them. The PG remained in a down state and could not serve IO directed at it resulting in the failure of any transfers of files with a stripe that would be placed in that PG. After trying to recover the PG to an operational for a week with no success, it was decided to write off the PG and its data (23k ATLAS files) and recreate it to make the service fully available again.
  
 +
==== Detailed report ====
 +
On the week commencing Mon 14th Aug, the Echo storage cluster was rebalancing its data onto a set of newly introduced storage nodes. A large number of PGs were in a backfill (or backfill_wait) state and moving their data across to newly chosen OSDs. This was being done by means of a weighting mechanism that Ceph provides. The disks had been introduced and were gradually being brought up to their final, full weight, automatically by a script that performs this in steps, waiting for healthy states in between iterations. As the storage nodes (SNs) had been sitting in racks for some time, running but not in use, a proportion of their disks had a number of bad sectors that had not been identified and marked as such. As these sectors were reached, Ceph started complaining of inconsistencies in the data, a condition it is normally able to deal with cleanly and without effect to the data.
 +
 +
On Fri 18th Aug, the Ceph-on-duty (CoD) was called to investigate a cluster-wide error state. Unlike the previous occurrences of inconsistencies, the primary OSD was crashing consistently during the backfilling process of PG 1.138, part of the pool that serves the ATLAS VO. The CoD removed the offending OSD from the problem PG, under the assumption the problem was caused by on-disk data or the OSD's recorded state. The CoD also halted the rebalancing process from further iterations (the data movement initiated by the last one was still in progress) and set the noscrub and nodeep-scrub flags to prevent scrubbing operations from contending with the recovery and backfill operations the troubled PG would have to perform.
 +
 +
The PG primary duties temporarily moved to OSDs ranked 1 and 2 and caused them to crash as well, leaving the PG with 8 OSDs up. Since the ''min_size'' on this pool is set to 9, the PG went into the down state and waited for enough OSDs to become available. In that state the PG is no longer able to server IO and was causing transfers of files whose stripes would map to that PG to hang until they timed out.
 +
 +
At this point CoD called another, off-duty member of the Ceph team to consult. The error message was unclear and after an examination of the Ceph bug tracking systems it was decided to upgrade Ceph to version 11.2.1-0 to reverse a regression introduced by a bug fix that made it into 11.2.0-0. On Mon 21st Aug, the Ceph team successfully tested the upgrade on the pre-production envinronment.
 +
 +
While testing the upgrade, the Ceph team attempted to correct the problem by other means.
 +
 +
Flags were set to prevent the cluster from initiating recovery processes (which trigger the bug) and OSDs from being marked down (to prevent Ceph from reacting to the flapping of the OSDs). They were unset once the ongoing recovery processes had completed.
 +
 +
On Tue 22nd Aug, Primary on Call set Echo endpoints to 'warning' in GOCDB for 24 hours.
 +
 +
The Ceph team removed a flapping OSD (1632) aand re-introduced those that were removed on Friday (1290, 1713) in an effort to bring the cluster to a state with complete information that would allow the PG to peer.
 +
 +
Following this unsuccessful attempt, the cluster's monitors and the OSDs belonging to the PG were restarted to pick up the new version of Ceph (11.2.1-0) which contained the regression's reversal.
 +
 +
The same behaviour was observed but this time with a more relevant error message. They OSDs were stopped and marked out to prevent the cluster from experiencing peering storms and allow it to settle into a stable state.
 +
 +
Wed 23rd Aug was spent consulting the Ceph community for help in identifying and correcting the problem.
 +
 +
On Thu 24th Aug, the entire Echo service was put into a GOCDB downtime for 24 hours. The cluster was stopped and restarted to complete the update to 11.2.1-0. External gateways to the service were stopped and the cluster set to 'pause' IO. Another attempt to resolve the issue by removing the crashing first three OSDs was made. This proved ineffective and the OSDs were returned to the cluster.
 +
 +
On Fri 25th Aug, Ceph consultant Wido den Hollander verified the Ceph team's understanding of the problem and the potential solutions. Internal gateways (from worker nodes) were stopped.
 +
 +
Shortly after the talking to the consultant, the Ceph team and project leadership decided to write off the files hosted on the problem PG in order to restore the service.
 +
 +
Ceph team implemented the procedure to recreate the PG (destructively). The OSDs were stopped and their datastores were manually marked complete. This was an (unsuccessful) attempt to resolve the incomplete state of the PG. When restarted, osd.1513 entered a crash-restart loop on account of datastore corruption. At this point, the PG had lost 4 OSDs (more than the tolerable level of failure) and was unrecoverable. Ceph team moved forward with the recreation process: the OSDs were stopped again, the datastores manually wiped clean of the PG, the PG was ''force_recreate''d, the OSDs were then started and an option to indicate Ceph should ignore history was temporarily set. The primary OSD in the set was restarted and the PG became active+clean.
 +
 +
This resulted in the loss of 23k ATLAS files.
 +
 +
After the long weekend, on Wed 30th Aug, the external gateways were updated to 11.2.1-0.
 +
 +
On Thu 31st Aug, two OSDs (1713, 726) were weighted to 0 for draining as they were showing disk errors. Scrubbing was re-enabled. Later on the same day the participation of the draining OSDs in the backfill operation for PGs 1.156 and 1.652 triggered the same bug, causing the same type of cascading failure across the acting sets of these PGs and ultimately taking the PGs down+incomplete. Scrubbing was disabled again.
 +
 +
Echo was put in downtime again until 1st Sep midday, subsequently extended to 5pm on 1st Sep.
 +
 +
On Fri 1st Sep, the Ceph team, down to 1 person at this point, gained better insight into the problem by analysing the logs of the crashing OSDs. It was discovered that bad disks would cause OSDs to send corrupted data to their primaries during backfilling, causing the primaries to crash. A crash-restart loop would ensue until systemd gave up. Then the next in rank OSD would become the primary temporarily only to face the same issue. This cascading failure would eventually take down enough OSDs for the PG to become down+incomplete. Ceph team member, Tom Byrne, solved the problem by removing the OSDs sending corrupted data to their primaries from the cluster once the crash-restart loops began.
 +
 +
The gateways were re-enabled at 5pm on Fri 1st Sep.
 +
 +
Over the following week, this occurred 3 more times due to the backfilling caused by removing the OSDs that were triggering the bug. At the same time scrubbing was enabled during office hours only and began identifying large numbers of inconsistencies due to disk errors, mostly on the newly introduced storage nodes. Eventually, the problem disks were removed from the cluster, breaking the feedback loop.
 +
 +
Over the next weeks the cluster has been returning to normal operation and performance. The backfilling bug remains in place but the Ceph team now have a method of dealing with it. Scrubbing and deep scrubbing are still not on 24/7.
  
 
===Analysis===
 
===Analysis===
  
 +
Prior to the problem occurring, the remainder of the '15 generation storage nodes had been introduced at low CRUSH weights and were gradually being weighted up to their final, full weights. During the first week of this process there was a marked increase in the number of inconsistent placement groups observed across the cluster. This pattern was not identified and analysed by the Ceph team at the time; it remains unknown whether those PGs were among those in the process of moving (or had recently moved) due to the introduction of additional storage hardware. As became clear soon after this problem occurred, the newly-introduced storage nodes hosted a number of disks with problems; those disks had not been identified as they had not been in use and the test that was conducted in pre-prod only identified a few problem disks. This is thought to be due to the write-only nature of the test conducted in pre-prod, where bad sectors and other such issues with HDDs are normally identified on read operations.
 +
 +
PG 1.138 (ATLAS pool) was one of the PGs to move, an OSD was introduced to its set, backed by a disk that had a problem. The backfill process issued read operations on corrupted data which triggered a known bug in Ceph, concerning EC pools and backfill operations. The primary OSD of the PG entered crash-restart loop until systemd gave up. Gradually, this cascaded to the OSDs in ranks 1 and 2 (Ceph ranks start at 0) taking down 3/11 OSDs, below the configured minimum of 9. The PG went down and the Ceph-on-duty, in an attempt to restore it to operation, lowered the threshold to no avail. With the PG down, a significant number of ATLAS transfers were failing, causing resource contention problems in the gateways, affecting the other VOs on the service. The Ceph-on-duty then attempted to remove the crashing OSDs in an effort to backfill new OSDs into the PG, also to no avail, as, at this point, not enough information was available for the PG to complete the peering process. At this point, the rest of the Ceph team became involved and the bug was identified as a regression introduced in the current version, with a fixed version available. It was decided to upgrade to this version of Ceph in an effort to prevent the crash from happening. This helped in that it allowed the OSDs to crash with more meaningful errors in their logs.
 +
 +
As the way to resolve this problem was unknown and there was no response from the online Ceph communities, a number of strategies to mitigate general PG problems were tried. The interfaces to the storage system were shut down to prevent load and the service became fully unavailable. Most of the aforementioned strategies (such as removing offending OSDs, changing options regarding history and/or stopping backfilling) were ultimately fruitless, mostly due to the fact that they rely on the PG having completed peering. With the other methods failing, it was decided to proceed with methods that could lead to data loss. Most notably, the offending PG's OSDs were all stopped and their objectstores manually intervened with to mark them complete (after a previous attempt which had put the PG in an incomplete state). After resolving the incomplete state, however, a fourth OSD started, complaining about corruption to its metadata store. At this point, with 4/11 OSDs inoperable, data loss was inevitable.
 +
 +
The process for manually destroying the PGs data and recreating it was implemented, ~23k ATLAS files were lost and the cluster became operable again. In the following days, a large number of errors were observed on the disks that were being introduced, including recurrences of this problem. Unlike the first instance, the subsequent times this occurred, the OSDs that were removed were the ones with the offending shards (data corruption, triggering their primaries' crashes) and the affected PGs have been able to recover with no loss of data or available.
  
 
===Follow Up===
 
===Follow Up===
Line 159: Line 213:
  
 
====Disk replacement====
 
====Disk replacement====
From the beginning of this year it was decided that the Echo cluster could not operate reliably if there were any disk issues.  A procedure for handling disk issues was created https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/EchoDiskReplacement
+
From the beginning of this year it was decided that the Echo cluster could not operate reliably if there were any disk issues.  A procedure for handling disk errors was created https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/EchoDiskReplacement
  
 
and has been followed by the Ceph team, involving the Fabric team in the actual disk replacement, after the Ceph team have removed the disk from the cluster.  With the large number of disks in the cluster the rate of replacement was fairly high and, in July, after a meeting between the Fabric Team and the Ceph Team an alternative to the Fabric Team requesting a replacement disk for the Ceph servers if only 1 error had been seen was proposed by the Fabric team manager:
 
and has been followed by the Ceph team, involving the Fabric team in the actual disk replacement, after the Ceph team have removed the disk from the cluster.  With the large number of disks in the cluster the rate of replacement was fairly high and, in July, after a meeting between the Fabric Team and the Ceph Team an alternative to the Fabric Team requesting a replacement disk for the Ceph servers if only 1 error had been seen was proposed by the Fabric team manager:

Latest revision as of 18:08, 3 October 2017

Data Loss from RAL-LCG2-ECHO

Description:

On August 18th, a placement group (PG) on the ATLAS pool ran into a Ceph bug related to backfilling Erasure Coding pools. A placement group is a logical set of object storage daemons that store a subset of the data in that pool. Object storage daemons are the main building block of a Ceph cluster, and are responsible for storing objects on a local disk, and communicating and exchanging data with other OSDs. The PG was backfilling data onto a new OSD following the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary OSD attempting to move a corrupted file. This caused the first three OSDs in the set to crash and start flapping until systemd gave up trying to restart them. The PG remained in a down state and could not serve IO directed at it resulting in the failure of any transfers of files with a stripe was (or was going to be) stored in that PG. After trying to recover the PG to an operational state for a week with no success, it was decided to write off the PG and its data (23k ATLAS files) and recreate it to make the service fully available again.

Impact

Service: Echo VOs affected: ATLAS Data Loss: Yes - 23k files.


Timeline of the Incident

When What
Fri 18/08 overnight into Saturday Multiple call-outs due to inconsisent PGs. BC identified and removed osd.1290 (PG 1.138 primary) as SOP for read errors on disks. BC lowered the min_size setting for the 'atlas' pool on Echo to 8. BC set noscrub, nodeep-scrub on Echo to get inconsistent PGs to start repairing sooner. Stopped gentle-reweight script from running.
Sat 19/08 11:44 BC called GV with PG 1.138 already down. We discussed and decided to upgrade Ceph to 11.2.1-0 (see RT #194003). Confirmed with AD via email.
Mon 21/08 10:39 GV started testing the Kraken upgrade on dev. GV confirmed successful upgrade of dev cluster to ceph version 11.2.1-0.
Mon 21/08 15:00 GV set nodown and norecover to prevent OSDs in problem PG from flapping. To stabilize the cluster and allow ongoing backfilling to complete. GV removed nodown and norecover flags once this had happened.
Tue 22/08 10:12 JK set Echo endpoints to warning until 23/08 10:03. Same time window for Nagios downtime on Echo health and functional checks.
Tue 22/08 11:22 GV removed osd.1632, which was flapping and re-introduced OSDs previously removed by BC (1290, 1713). This was done to try and provide Ceph with enough information to peer the PG.
Tue 22/08 14:54 Echo MONs and any OSDs belonging to PG 1.138 upgraded to version 11.2.1-0.
Tue 22/08 17:20 GV set configuration options to prioritise recovery/backfill operations that needed to complete in order to advance diagnosis resolution of PG 1.138 problem.
Tue 22/08 17:46 Post-upgrade, the first three OSDs [1290, 927, 672] of PG 1.138 started flapping again with a more relevant error message but same underlying problem. They were stopped and marked out to prevent peering storms in the cluster.
Wed 23/08 AP submitted consultancy request. GV emailed ceph-users mail list with details of the problem, also requested help via IRC channel.
Thu 24/08 10:23 Downtime of entire Echo service for 24 hours declared in GOCDB
Thu 24/08 11:41 Rest of cluster was upgraded to version 11.2.1-0. GV disabled first three OSDs for PG 1.138 so they wouldn't come back upon cluster restart. GV paused Echo IO to allow for faster recovery. GV restarted all MONs and OSDs on Echo to pick up Kraken 11.2.1-0.
Thu 24/08 12:15 BC stopped and disabled all xrootd, gridftp and ceph-radosgw processes on Echo gateways to prevent IO requests from reaching Echo.
Thu 24/08 13:20 All of Echo (except gateways) running 11.2.1.
Thu 24/08 13:37 GV removed first 3 OSDs [1290, 927, 672] of PG 1.138 as they were crashing.
Thu 24/08 17:27 GV re-introduced removed OSDs [1290, 927, 672] as it didn't resolve the problem. The aim of this was to return the cluster to the closest state possible in relation to the problem's occurrence.
Fri 25/08 11:00 Phone call with Ceph consultant Wido. He verified the process that GV had gone through. He verified patching. Discussed forced recreate of PG.
Fri 25/08 13:39 AL stopped XRootD on WNs
Fri 25/08 It had been decided to take the data loss of 23k ATLAS files and attempt to recover the PG by means destructive to the data. GV stopped PG 1.138's set [1290 672 927 456 177 1094 194 1513 236 302 1326]. OSDs 1290 672 927 1513 had corrupted data. GV started by manually removing the PG from these OSDs datastores. The hope was that the broken PG data could be recreated by EC computation, however, because 4 OSDs had corruption there was not enough data to proceed. The PG went remapped+incomplete. A failed attempt was made to force create the PG anew. The manual removal operation was performed on all OSDs, which still didn't suffice. The PG went into the creating state and blocked there. PG query output indicated history-related issues so a runtime argument was injected into the relevant OSDs to ignore history. The primary was then restarted and the PG went active+clean.

Since the Placement Group was recreated and the data declared lost, there have been a number of residual issues while attempting to return the cluster to normal operational status. Over the long August Bank Holiday weekend, Echo had scrubbing turned off and no-out both of which are essential to long term health of cluster.

When What
Wed 30/08 Removed no-out and upgraded gateways to 11.2.1-0. Scrubbing was not turned back on (On Friday it had been agreed that it should have been switched on, on Wednesday).
Thu 31/08 16:19 TB set OSD 1713 and 726 CRUSH weight to 0 for replacement due to disk errors. This was not the normal way to remove OSDs from the cluster and in hindsight was an error. BC turned scrubbing back on.
Thu 31/08 16:48 OSDs started flapping, due to 1713 and 726 participating in the backfill of PG 1.156 and 1.652 and encountering read errors. Both PGs quickly became incomplete. BC turned scrubbing back off.
Thu 31/08 18:30 Downtime for entire Echo service was declared until Midday on 1st September, which was subsequently extended to 17:00.
Fri 01/09 09:30 The bug which caused read errors to the crash the primary was discovered while looking through the logs of osd.17 and 1713.
Fri 01/09 12:00 TB removed problem OSDs that have either caused crashes or are reporting many disk errors, and restarted all the crashed OSDs. All PGs had become active, and any new crashing OSDs were diagnosed and the offending OSD removed.
Fri 01/09 17:00 Gateways were re-enabled (external and internal).
Fri 01/09 22:38 Another incomplete PG due to the bug occurred, dealt with by TB with minimal disruption to service
Sat 02/09 10:00 Backfilling of the recently removed OSDs had finished, the cluster was back to a healthy state!
Tues 05/09 13:35 During disk removal by TB, primary starts flapping due to the EC backfilling bug, dealt with by the removal of the problem OSD with no disruption.
Thurs 07/09 TB renable scrubbing (but not deep-scrubbing). During the rest of the working day, four inconsistent PGs were found. They were fixed using SOP. Decision to disable scrubbing outside of office hours was made.
Tues 19/09 15:00 During disk removal by TB, primary starts flapping due to the EC backfilling bug, dealt with by the removal of the problem OSD with no disruption.

Incident details

Summary

On August 18th, PG 1.138 on the ATLAS pool went ran into a Ceph bug related to backfilling EC pools. The PG was in backfill as it was part of the data re-balancing after the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary OSD attempting to move a corrupted file. This caused the first three OSDs in the set to crash and start flapping until systemd gave up trying to restart them. The PG remained in a down state and could not serve IO directed at it resulting in the failure of any transfers of files with a stripe that would be placed in that PG. After trying to recover the PG to an operational for a week with no success, it was decided to write off the PG and its data (23k ATLAS files) and recreate it to make the service fully available again.

Detailed report

On the week commencing Mon 14th Aug, the Echo storage cluster was rebalancing its data onto a set of newly introduced storage nodes. A large number of PGs were in a backfill (or backfill_wait) state and moving their data across to newly chosen OSDs. This was being done by means of a weighting mechanism that Ceph provides. The disks had been introduced and were gradually being brought up to their final, full weight, automatically by a script that performs this in steps, waiting for healthy states in between iterations. As the storage nodes (SNs) had been sitting in racks for some time, running but not in use, a proportion of their disks had a number of bad sectors that had not been identified and marked as such. As these sectors were reached, Ceph started complaining of inconsistencies in the data, a condition it is normally able to deal with cleanly and without effect to the data.

On Fri 18th Aug, the Ceph-on-duty (CoD) was called to investigate a cluster-wide error state. Unlike the previous occurrences of inconsistencies, the primary OSD was crashing consistently during the backfilling process of PG 1.138, part of the pool that serves the ATLAS VO. The CoD removed the offending OSD from the problem PG, under the assumption the problem was caused by on-disk data or the OSD's recorded state. The CoD also halted the rebalancing process from further iterations (the data movement initiated by the last one was still in progress) and set the noscrub and nodeep-scrub flags to prevent scrubbing operations from contending with the recovery and backfill operations the troubled PG would have to perform.

The PG primary duties temporarily moved to OSDs ranked 1 and 2 and caused them to crash as well, leaving the PG with 8 OSDs up. Since the min_size on this pool is set to 9, the PG went into the down state and waited for enough OSDs to become available. In that state the PG is no longer able to server IO and was causing transfers of files whose stripes would map to that PG to hang until they timed out.

At this point CoD called another, off-duty member of the Ceph team to consult. The error message was unclear and after an examination of the Ceph bug tracking systems it was decided to upgrade Ceph to version 11.2.1-0 to reverse a regression introduced by a bug fix that made it into 11.2.0-0. On Mon 21st Aug, the Ceph team successfully tested the upgrade on the pre-production envinronment.

While testing the upgrade, the Ceph team attempted to correct the problem by other means.

Flags were set to prevent the cluster from initiating recovery processes (which trigger the bug) and OSDs from being marked down (to prevent Ceph from reacting to the flapping of the OSDs). They were unset once the ongoing recovery processes had completed.

On Tue 22nd Aug, Primary on Call set Echo endpoints to 'warning' in GOCDB for 24 hours.

The Ceph team removed a flapping OSD (1632) aand re-introduced those that were removed on Friday (1290, 1713) in an effort to bring the cluster to a state with complete information that would allow the PG to peer.

Following this unsuccessful attempt, the cluster's monitors and the OSDs belonging to the PG were restarted to pick up the new version of Ceph (11.2.1-0) which contained the regression's reversal.

The same behaviour was observed but this time with a more relevant error message. They OSDs were stopped and marked out to prevent the cluster from experiencing peering storms and allow it to settle into a stable state.

Wed 23rd Aug was spent consulting the Ceph community for help in identifying and correcting the problem.

On Thu 24th Aug, the entire Echo service was put into a GOCDB downtime for 24 hours. The cluster was stopped and restarted to complete the update to 11.2.1-0. External gateways to the service were stopped and the cluster set to 'pause' IO. Another attempt to resolve the issue by removing the crashing first three OSDs was made. This proved ineffective and the OSDs were returned to the cluster.

On Fri 25th Aug, Ceph consultant Wido den Hollander verified the Ceph team's understanding of the problem and the potential solutions. Internal gateways (from worker nodes) were stopped.

Shortly after the talking to the consultant, the Ceph team and project leadership decided to write off the files hosted on the problem PG in order to restore the service.

Ceph team implemented the procedure to recreate the PG (destructively). The OSDs were stopped and their datastores were manually marked complete. This was an (unsuccessful) attempt to resolve the incomplete state of the PG. When restarted, osd.1513 entered a crash-restart loop on account of datastore corruption. At this point, the PG had lost 4 OSDs (more than the tolerable level of failure) and was unrecoverable. Ceph team moved forward with the recreation process: the OSDs were stopped again, the datastores manually wiped clean of the PG, the PG was force_recreated, the OSDs were then started and an option to indicate Ceph should ignore history was temporarily set. The primary OSD in the set was restarted and the PG became active+clean.

This resulted in the loss of 23k ATLAS files.

After the long weekend, on Wed 30th Aug, the external gateways were updated to 11.2.1-0.

On Thu 31st Aug, two OSDs (1713, 726) were weighted to 0 for draining as they were showing disk errors. Scrubbing was re-enabled. Later on the same day the participation of the draining OSDs in the backfill operation for PGs 1.156 and 1.652 triggered the same bug, causing the same type of cascading failure across the acting sets of these PGs and ultimately taking the PGs down+incomplete. Scrubbing was disabled again.

Echo was put in downtime again until 1st Sep midday, subsequently extended to 5pm on 1st Sep.

On Fri 1st Sep, the Ceph team, down to 1 person at this point, gained better insight into the problem by analysing the logs of the crashing OSDs. It was discovered that bad disks would cause OSDs to send corrupted data to their primaries during backfilling, causing the primaries to crash. A crash-restart loop would ensue until systemd gave up. Then the next in rank OSD would become the primary temporarily only to face the same issue. This cascading failure would eventually take down enough OSDs for the PG to become down+incomplete. Ceph team member, Tom Byrne, solved the problem by removing the OSDs sending corrupted data to their primaries from the cluster once the crash-restart loops began.

The gateways were re-enabled at 5pm on Fri 1st Sep.

Over the following week, this occurred 3 more times due to the backfilling caused by removing the OSDs that were triggering the bug. At the same time scrubbing was enabled during office hours only and began identifying large numbers of inconsistencies due to disk errors, mostly on the newly introduced storage nodes. Eventually, the problem disks were removed from the cluster, breaking the feedback loop.

Over the next weeks the cluster has been returning to normal operation and performance. The backfilling bug remains in place but the Ceph team now have a method of dealing with it. Scrubbing and deep scrubbing are still not on 24/7.

Analysis

Prior to the problem occurring, the remainder of the '15 generation storage nodes had been introduced at low CRUSH weights and were gradually being weighted up to their final, full weights. During the first week of this process there was a marked increase in the number of inconsistent placement groups observed across the cluster. This pattern was not identified and analysed by the Ceph team at the time; it remains unknown whether those PGs were among those in the process of moving (or had recently moved) due to the introduction of additional storage hardware. As became clear soon after this problem occurred, the newly-introduced storage nodes hosted a number of disks with problems; those disks had not been identified as they had not been in use and the test that was conducted in pre-prod only identified a few problem disks. This is thought to be due to the write-only nature of the test conducted in pre-prod, where bad sectors and other such issues with HDDs are normally identified on read operations.

PG 1.138 (ATLAS pool) was one of the PGs to move, an OSD was introduced to its set, backed by a disk that had a problem. The backfill process issued read operations on corrupted data which triggered a known bug in Ceph, concerning EC pools and backfill operations. The primary OSD of the PG entered crash-restart loop until systemd gave up. Gradually, this cascaded to the OSDs in ranks 1 and 2 (Ceph ranks start at 0) taking down 3/11 OSDs, below the configured minimum of 9. The PG went down and the Ceph-on-duty, in an attempt to restore it to operation, lowered the threshold to no avail. With the PG down, a significant number of ATLAS transfers were failing, causing resource contention problems in the gateways, affecting the other VOs on the service. The Ceph-on-duty then attempted to remove the crashing OSDs in an effort to backfill new OSDs into the PG, also to no avail, as, at this point, not enough information was available for the PG to complete the peering process. At this point, the rest of the Ceph team became involved and the bug was identified as a regression introduced in the current version, with a fixed version available. It was decided to upgrade to this version of Ceph in an effort to prevent the crash from happening. This helped in that it allowed the OSDs to crash with more meaningful errors in their logs.

As the way to resolve this problem was unknown and there was no response from the online Ceph communities, a number of strategies to mitigate general PG problems were tried. The interfaces to the storage system were shut down to prevent load and the service became fully unavailable. Most of the aforementioned strategies (such as removing offending OSDs, changing options regarding history and/or stopping backfilling) were ultimately fruitless, mostly due to the fact that they rely on the PG having completed peering. With the other methods failing, it was decided to proceed with methods that could lead to data loss. Most notably, the offending PG's OSDs were all stopped and their objectstores manually intervened with to mark them complete (after a previous attempt which had put the PG in an incomplete state). After resolving the incomplete state, however, a fourth OSD started, complaining about corruption to its metadata store. At this point, with 4/11 OSDs inoperable, data loss was inevitable.

The process for manually destroying the PGs data and recreating it was implemented, ~23k ATLAS files were lost and the cluster became operable again. In the following days, a large number of errors were observed on the disks that were being introduced, including recurrences of this problem. Unlike the first instance, the subsequent times this occurred, the OSDs that were removed were the ones with the offending shards (data corruption, triggering their primaries' crashes) and the affected PGs have been able to recover with no loss of data or available.

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

Adding New Hardware

At the start of August the remaining 15 generation disk servers (25) that had not been put into production in March were added to the Cluster. Unlike with Castor, Ceph will automatically balance data across the servers. If disk servers are added one at a time, the rebalancing load will all be focused on one node, so the data rebalancing will either occur very slowly or place significant load on the machine. It will also result in a significant amount more rebalancing taking place as data will be moved between the new disk servers as they are deployed.


Disk replacement

From the beginning of this year it was decided that the Echo cluster could not operate reliably if there were any disk issues. A procedure for handling disk errors was created https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/EchoDiskReplacement

and has been followed by the Ceph team, involving the Fabric team in the actual disk replacement, after the Ceph team have removed the disk from the cluster. With the large number of disks in the cluster the rate of replacement was fairly high and, in July, after a meeting between the Fabric Team and the Ceph Team an alternative to the Fabric Team requesting a replacement disk for the Ceph servers if only 1 error had been seen was proposed by the Fabric team manager:

"To deal with media errors on a disk, the following is proposed as a way forward: When a media error occurs, the disk should be ejected from Echo and reformatted to force a block remap, then returned to Echo. This can be automated using a script and driven by the Ceph team. When a disk has accumulated a media error count above a suitable limit that would enable the disk to be swapped out by the supplier, the disk should be replaced with a new disk using the existing procedure."

It appears that this alternative procedure was not put in place and that disks with media errors had been accumulating in the ensuing weeks, it's not clear why as the disks could still have been removed from the cluster even if the next part of the procedure was still to be resolved.

The need to remove disks from the cluster if they have any media errors has been reiterated and the Ceph team will work with the Fabric team to update the procedure as necessary to reflect any changes needed in the process of managing disk errors.

Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes