RAL Tier1 Incident 20180212 genTape broken overweekend

Network misconfiguration RAL-LCG2-ECHO

Description

On the 8th February 2018, an intervention was performed on the Viglen 11 generation (gdss618-624)

Impact

Service: CASTOR VOs affected: Alice & all non-LHC VOs Data Loss: No

Timeline of the Incident

(Sample text)

}

Incident details

Summary

Detailed report

Analysis

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


When	What
Fri 18/08 overnight into Saturday	Multiple call-outs due to inconsisent PGs. BC identified and removed osd.1290 (PG 1.138 primary) as SOP for read errors on disks. BC lowered the min_size setting for the 'atlas' pool on Echo to 8. BC set noscrub, nodeep-scrub on Echo to get inconsistent PGs to start repairing sooner. Stopped gentle-reweight script from running.
Sat 19/08 11:44	BC called GV with PG 1.138 already down. We discussed and decided to upgrade Ceph to 11.2.1-0 (see RT #194003). Confirmed with AD via email.


Issue	Response	Done
Issue 1	Mitigation for issue 1.	Done yes/no
Issue 2	Mitigation for issue 2.	Done yes/no

Related issues

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue	Response	Done
Issue 1	Mitigation for issue 1.	Done yes/no
Issue 2	Mitigation for issue 2.	Done yes/no

Related issues

Adding New Hardware

At the start of August the remaining 15 generation disk servers (25) that had not been put into production in March were added to the Cluster. Unlike with Castor, Ceph will automatically balance data across the servers. If disk servers are added one at a time, the rebalancing load will all be focused on one node, so the data rebalancing will either occur very slowly or place significant load on the machine. It will also result in a significant amount more rebalancing taking place as data will be moved between the new disk servers as they are deployed.

Disk replacement

From the beginning of this year it was decided that the Echo cluster could not operate reliably if there were any disk issues. A procedure for handling disk errors was created https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/EchoDiskReplacement

and has been followed by the Ceph team, involving the Fabric team in the actual disk replacement, after the Ceph team have removed the disk from the cluster. With the large number of disks in the cluster the rate of replacement was fairly high and, in July, after a meeting between the Fabric Team and the Ceph Team an alternative to the Fabric Team requesting a replacement disk for the Ceph servers if only 1 error had been seen was proposed by the Fabric team manager:

"To deal with media errors on a disk, the following is proposed as a way forward: When a media error occurs, the disk should be ejected from Echo and reformatted to force a block remap, then returned to Echo. This can be automated using a script and driven by the Ceph team. When a disk has accumulated a media error count above a suitable limit that would enable the disk to be swapped out by the supplier, the disk should be replaced with a new disk using the existing procedure."

It appears that this alternative procedure was not put in place and that disks with media errors had been accumulating in the ensuing weeks, it's not clear why as the disks could still have been removed from the cluster even if the next part of the procedure was still to be resolved.

The need to remove disks from the cluster if they have any media errors has been reiterated and the Ceph team will work with the Fabric team to update the procedure as necessary to reflect any changes needed in the process of managing disk errors.

Reported by: Your Name at date/time

Summary Table


Start Date	Date e.g. 20 July 2010
Impact	Select one of: >80%, >50%, >20%, <20%
Duration of Outage	Hours e.g. 3hours
Status	select one from Draft, Open, Understood, Closed
Root Cause	Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss	Yes

RAL Tier1 Incident 20180212 genTape broken overweekend

Contents

Network misconfiguration RAL-LCG2-ECHO

Description

Impact

Timeline of the Incident

Incident details

Summary

Detailed report

Analysis

Follow Up

Related issues

Follow Up

Related issues

Adding New Hardware

Disk replacement

Reported by: Your Name at date/time

Summary Table

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools