Difference between revisions of "RAL Tier1 Incident 20180810 Echo OSD memory problem"

From GridPP Wiki
Jump to: navigation, search
(Created page with "==20180810 Echo OSD memory problem== ===Description=== ===Impact=== Service: Echo VOs affected: ATLAS and CMS Data Loss: No ===Timeline of the Incident=== {|border="1...")
 
Line 21: Line 21:
 
!What
 
!What
 
|-
 
|-
| ''Wednesday 2018-02-07''
+
| ''Thursday 2018-08-09 0110 and 0335''
| GP/KH/RA Intervention on 2011 generation genTape disk servers (gdss618-24) to physically move them between racks. This was necessary to make space for other hardware. This went to plan and was tracked in [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=203921 RT203921]
+
| JPK/RA First callouts for this problem (high RAM usage on an SN resulting in poor performance and problems peering). RA tried marking the OSD out. [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=211477 RT211477]
 
|-
 
|-
| ''Thursday 2018-02-08 Midday''
+
| ''Thursday 2018-08-09 and Friday 2018-08-10''  
| KH commits SCDB change to decommission remaining 2011-generation disk servers. The SCDB change (correctly) omits the nodes still in genTape
+
| Situation gradually deteriorates. Callouts on Friday evening convince RA & JK to declare down overweekend pending a fix.
 
|-
 
|-
| ''Friday 2018-02-09 Morning''
+
| ''Saturday 2018-08-11 and Sunday 2018-08-12''
| KH puts a networking ticket in asking to remove the hostnames of 2011-generation nodes from the DNS. This mistakenly includes the nodes still in genTape. This is actioned by the networking team
+
| RA and JK make various attempts to fix the problems, with little success. RA texts TB (who wasn't OC) at one point and receives useful advice.
 
|-
 
|-
| ''Friday 2018-02-09 09:50''
+
| ''Monday 2018-08-13''
| Last evidence of a working transfer from a a genTape disk server
+
| First disaster management review meeting. Agreed to call in WdH (consultant) for help
|-
+
| ''Friday 2018-02-09 14:30ish''
+
| Prompted by a failed DMESG Nagios test, GP attempts to ssh (using PuTTY) into 2 genTape disk servers without success (Error message 'gdssXYZ.gridpp.rl.ac.uk does not exist') but takes no further action.
+
 
|-
 
|-
 +
| ''Thursday 2018-08-16''
 +
| Second DM meeting. Progress made:
 +
|  - Tried a 'cold restart' of Echo. Didn't help.
 +
|  - Tested a doubling of the amount of RAM in a problematic SN. This allowed it to peer when before the upgrade it was not able to.
 +
|  - Found information on a bug in Echo that caused a slow, permanent memory leak in Ceph OSDs. Internal log not being correctly trimmed. Tested an upgrade of Ceph software which included a bugfix.
 +
|  - Tr
 
| '' Friday 2018-02-09 to Monday 2018-02-12''
 
| '' Friday 2018-02-09 to Monday 2018-02-12''
 
| No access possible to genTape nodes, team unaware. We are unclear as to why this did not cause an alarm. Possibly due to caching?
 
| No access possible to genTape nodes, team unaware. We are unclear as to why this did not cause an alarm. Possibly due to caching?

Revision as of 12:36, 25 September 2018

20180810 Echo OSD memory problem

Description

Impact

Service: Echo

VOs affected: ATLAS and CMS

Data Loss: No

Timeline of the Incident

When What
Thursday 2018-08-09 0110 and 0335 JPK/RA First callouts for this problem (high RAM usage on an SN resulting in poor performance and problems peering). RA tried marking the OSD out. RT211477
Thursday 2018-08-09 and Friday 2018-08-10 Situation gradually deteriorates. Callouts on Friday evening convince RA & JK to declare down overweekend pending a fix.
Saturday 2018-08-11 and Sunday 2018-08-12 RA and JK make various attempts to fix the problems, with little success. RA texts TB (who wasn't OC) at one point and receives useful advice.
Monday 2018-08-13 First disaster management review meeting. Agreed to call in WdH (consultant) for help
Thursday 2018-08-16 Second DM meeting. Progress made: - Tried a 'cold restart' of Echo. Didn't help. - Tested a doubling of the amount of RAM in a problematic SN. This allowed it to peer when before the upgrade it was not able to. - Found information on a bug in Echo that caused a slow, permanent memory leak in Ceph OSDs. Internal log not being correctly trimmed. Tested an upgrade of Ceph software which included a bugfix. - Tr Friday 2018-02-09 to Monday 2018-02-12 No access possible to genTape nodes, team unaware. We are unclear as to why this did not cause an alarm. Possibly due to caching?
Monday 2018-02-12 06:41 Snoplus raise a [1] GGUS ticket saying that "...jobs we are submitting are completing successfully but fail to transfer their outputs".
Monday 2018-02-12 09:30-10:30 GP investigates situation based on GGUS ticket. He discovers that CASTOR commands on CASTOR Gen are unreliable - they often return an exception and when they do work, they show large queues but no activity on the genTape disk servers. He tries SSH again (using both PuTTY and the shell) on 2-3 nodes without success (getting the same PuTTY message as before and "ssh: Could not resolve hostname gdssXYZ.gridpp.rl.ac.uk: Name or service not known" from a shell), and then (at 10:15) emails KH and CW to inquire about the status of these disk servers, followed by walking over to talk 15 minutes later.
Monday 2018-02-12 10:30 KH realises what has happened and sends a request to networking to do fix the DNS entries.
Monday 2018-02-12 10:45 Hardware starts coming back online. GP clears DMESG and restarts diskmanagerd daemons. Large queues remain in the transfermanagers, and they are busy failing all the queued transfers and not starting any new ones. CASTOR team decide to wait and see if they are quickly cleared by the scheduler rather than directly intervene.
Monday 2018-02-12 11:45 Patience waiting for the scheduling queues to clear is exhausted and RA restarts both the transfermanager and diskmanager daemons to forcibly clear the queues ('the big hammer'). Data transfers begin again, including an uptick in network activity as seen in Ganglia.

Incident details

Summary

Detailed report

Analysis

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

Adding New Hardware

Disk replacement

Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes