Difference between revisions of "RAL Tier1 Incident 20180810 Echo OSD memory problem"

From GridPP Wiki
Jump to: navigation, search
Line 32: Line 32:
 
| ''Monday 2018-08-13''
 
| ''Monday 2018-08-13''
 
| First disaster management review meeting. Agreed to call in WdH (consultant) for help
 
| First disaster management review meeting. Agreed to call in WdH (consultant) for help
 +
|-
 
| ''Tuesday 2018-08-13''
 
| ''Tuesday 2018-08-13''
 
| TB found information on a bug in Echo that caused a slow, permanent memory leak in Ceph OSDs. Internal log not being correctly trimmed. Tested an upgrade of Ceph software which included a bugfix.
 
| TB found information on a bug in Echo that caused a slow, permanent memory leak in Ceph OSDs. Internal log not being correctly trimmed. Tested an upgrade of Ceph software which included a bugfix.
Line 49: Line 50:
 
| All tests passed. Agreed to end downtime for CMS and ATLAS (but to block LHCb's access to their non-production endpoint)
 
| All tests passed. Agreed to end downtime for CMS and ATLAS (but to block LHCb's access to their non-production endpoint)
 
| Agreed to cap jobs on the farm to 50% of pledge, and to declare down again in the event of problems over the weekend
 
| Agreed to cap jobs on the farm to 50% of pledge, and to declare down again in the event of problems over the weekend
| -
+
|-
 
| ''Monday 2018-08-20''
 
| ''Monday 2018-08-20''
 
| Final DM review meeting
 
| Final DM review meeting
 
| Quiet weekend, memory usage stable, no misbehaviour.  
 
| Quiet weekend, memory usage stable, no misbehaviour.  
 
| Increased VO job limits to match pledged amount
 
| Increased VO job limits to match pledged amount
| -
+
|-
 
| ''Ongoing activity''
 
| ''Ongoing activity''
 
| Continued RAM upgrades as a background task, removing one SN at a time.
 
| Continued RAM upgrades as a background task, removing one SN at a time.

Revision as of 17:16, 27 September 2018

20180810 Echo OSD memory problem

Description

Impact

Service: Echo

VOs affected: ATLAS and CMS

Data Loss: No

Timeline of the Incident

When What
Thursday 2018-08-09 0110 and 0335 JPK/RA First callouts for this problem (high RAM usage on an SN resulting in poor performance and problems peering). RA tried marking the OSD out. RT211477
Thursday 2018-08-09 and Friday 2018-08-10 Situation gradually deteriorates. Callouts on Friday evening convince RA & JK to declare down overweekend pending a fix.
Saturday 2018-08-11 and Sunday 2018-08-12 RA and JK make various attempts to fix the problems, with little success. RA texts TB (who wasn't OC) at one point and receives useful advice.
Monday 2018-08-13 First disaster management review meeting. Agreed to call in WdH (consultant) for help
Tuesday 2018-08-13 TB found information on a bug in Echo that caused a slow, permanent memory leak in Ceph OSDs. Internal log not being correctly trimmed. Tested an upgrade of Ceph software which included a bugfix. Tried a 'cold restart' of Echo. Didn't help.
Wednesday 2018-08-15 Consultation with WdH. He suggests a target version to upgrade to and notes that our SNs are marginal on RAM. Upgraded one production SN to newer Ceph version with preventative bugfix and tool for dealing with large, existing logs. Tried trimming logs on one SN, this seemed to help Began work on order of 64GB/node extra RAM
Thursday 2018-08-16 Second DM meeting. Another full cold start, which went better. Decided that if the cluster was usable following the Ceph software upgrade we would bring it back up on Friday, assuming a set of tests were passed
Friday 2018-08-17 Third DM review meeting All tests passed. Agreed to end downtime for CMS and ATLAS (but to block LHCb's access to their non-production endpoint) Agreed to cap jobs on the farm to 50% of pledge, and to declare down again in the event of problems over the weekend
Monday 2018-08-20 Final DM review meeting Quiet weekend, memory usage stable, no misbehaviour. Increased VO job limits to match pledged amount
Ongoing activity Continued RAM upgrades as a background task, removing one SN at a time.

}

Incident details

Summary

Detailed report

Analysis

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

Adding New Hardware

Disk replacement

Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes