Difference between revisions of "RAL Tier1 Incident 20180212 genTape broken overweekend"

From GridPP Wiki
Jump to: navigation, search
Line 3: Line 3:
 
===Description===
 
===Description===
  
On the 8th February 2018, an intervention was performed on the Viglen 11 generation (gdss618-624)
+
On the 9th February 2018, the disk servers that make up genTape were removed from the DNS at the request of the
 
+
  
 
===Impact===
 
===Impact===
Line 29: Line 28:
 
|-
 
|-
 
| ''Friday 2018-02-09 Morning''
 
| ''Friday 2018-02-09 Morning''
| ''KH puts a networking ticket in asking to remove the hostnames of 2011-generation nodes from the DNS. This mistakenly includes the nodes still in genTape.''
+
| ''KH puts a networking ticket in asking to remove the hostnames of 2011-generation nodes from the DNS. This mistakenly includes the nodes still in genTape. This is actioned by the networking team''
 
|-
 
|-
 
| ''Friday 2018-02-09 09:50''
 
| ''Friday 2018-02-09 09:50''
Line 50: Line 49:
 
|-
 
|-
 
| ''Monday 2018-02-12 10:45''
 
| ''Monday 2018-02-12 10:45''
| Hardware starts coming back online. GP clears DMESG and restarts diskmanagerd daemons. Large queues remain in the transfermanagers, and they are busy failing all the queued transfers and not starting any new ones.
+
| Hardware starts coming back online. GP clears DMESG and restarts diskmanagerd daemons. Large queues remain in the transfermanagers, and they are busy failing all the queued transfers and not starting any new ones. CASTOR team decide to wait and see if they are quickly cleared by the scheduler rather than directly intervene.
 
|-
 
|-
 
| ''Monday 2018-02-12 11:45''
 
| ''Monday 2018-02-12 11:45''
| Patience waiting for the scheduling queues to clear is exhausted and RA restarts both the transfermanager and diskmanager daemons to forcibly clear the queues. Data transfers begin again, including an uptick in network activity as seen in Ganglia.
+
| Patience waiting for the scheduling queues to clear is exhausted and RA restarts both the transfermanager and diskmanager daemons to forcibly clear the queues ('the big hammer'). Data transfers begin again, including an uptick in network activity as seen in Ganglia.
|-
+
|}
 
+
 
+
}
+
  
 
===Incident details===
 
===Incident details===
Line 116: Line 112:
  
 
==== Adding New Hardware ====
 
==== Adding New Hardware ====
At the start of August the remaining 15 generation disk servers (25) that had not been put into production in March were added to the Cluster.  Unlike with Castor, Ceph will automatically balance data across the servers.  If disk servers are added one at a time, the rebalancing load will all be focused on one node, so the data rebalancing will either occur very slowly or place significant load on the machine.  It will also result in a significant amount more rebalancing taking place as data will be moved between the new disk servers as they are deployed.
 
  
  
 
====Disk replacement====
 
====Disk replacement====
From the beginning of this year it was decided that the Echo cluster could not operate reliably if there were any disk issues.  A procedure for handling disk errors was created https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/EchoDiskReplacement
 
 
and has been followed by the Ceph team, involving the Fabric team in the actual disk replacement, after the Ceph team have removed the disk from the cluster.  With the large number of disks in the cluster the rate of replacement was fairly high and, in July, after a meeting between the Fabric Team and the Ceph Team an alternative to the Fabric Team requesting a replacement disk for the Ceph servers if only 1 error had been seen was proposed by the Fabric team manager:
 
 
"To deal with media errors on a disk, the following is proposed as a way forward:
 
When a media error occurs, the disk should be ejected from Echo and reformatted to force a block remap, then returned to Echo.  This can be automated using a script and driven by the Ceph team.  When a disk has accumulated a media error count above a suitable limit that would enable the disk to be swapped out by the supplier, the disk should be replaced with a new disk using the existing procedure."
 
 
It appears that this alternative procedure was not put in place and that disks with media errors had been accumulating in the ensuing weeks, it's not clear why as the disks could still have been removed from the cluster even if the next part of the procedure was still to be resolved.
 
  
The need to remove disks from the cluster if they have any media errors has been reiterated and the Ceph team will work with the Fabric team to update the procedure as necessary to reflect any changes needed in the process of managing disk errors.
 
  
 
====Reported by: ''Your Name  at date/time'' ====
 
====Reported by: ''Your Name  at date/time'' ====

Revision as of 14:30, 13 February 2018

20180212 network misconfiguration of CASTOR genTape

Description

On the 9th February 2018, the disk servers that make up genTape were removed from the DNS at the request of the

Impact

Service: CASTOR.

VOs affected: Alice & all non-LHC VOs.

Data Loss: No

Timeline of the Incident

When What
Wednesday 2018-02-07 GP/KH/RA Intervention on 2011 generation genTape disk servers (gdss618-24) to physically move them between racks. This was necessary to make space for other hardware. This went to plan and was tracked in RT203921
Thursday 2018-02-08 Midday KH commits SCDB change to decommission remaining 2011-generation disk servers. The SCDB change (correctly) omits the nodes still in genTape
Friday 2018-02-09 Morning KH puts a networking ticket in asking to remove the hostnames of 2011-generation nodes from the DNS. This mistakenly includes the nodes still in genTape. This is actioned by the networking team
Friday 2018-02-09 09:50 Last evidence of a working transfer from a a genTape disk server
Friday 2018-02-09 14:30ish Prompted by a failed DMESG Nagios test, GP attempts to ssh (using PuTTY) into 2 genTape disk servers without success (Error message 'gdssXYZ.gridpp.rl.ac.uk does not exist') but takes no further action.
Friday 2018-02-09 to Monday 2018-02-12 No access possible to genTape nodes. We are unclear as to why this did not cause an alarm. Possibly due to caching?
Monday 2018-02-12 06:41 Snoplus raise a [1] GGUS ticket saying that "...jobs we are submitting are completing successfully but fail to transfer their outputs".
Monday 2018-02-12 09:30-10:30 GP investigates situation based on GGUS ticket. He discovers that CASTOR commands on CASTOR Gen are unreliable - they often return an exception and when they do work, they show large queues but no activity on the genTape disk servers. He tries SSH again (using both PuTTY and the shell) on 2-3 nodes without success (getting the same PuTTY message as before and "ssh: Could not resolve hostname gdssXYZ.gridpp.rl.ac.uk: Name or service not known" from a shell), and then (at 10:15) emails KH and CW to inquire about the status of these disk servers, followed by walking over to talk 15 minutes later.
Monday 2018-02-12 10:30 KH realises what has happened and sends a request to networking to do fix the DNS entries.
Monday 2018-02-12 10:45 Hardware starts coming back online. GP clears DMESG and restarts diskmanagerd daemons. Large queues remain in the transfermanagers, and they are busy failing all the queued transfers and not starting any new ones. CASTOR team decide to wait and see if they are quickly cleared by the scheduler rather than directly intervene.
Monday 2018-02-12 11:45 Patience waiting for the scheduling queues to clear is exhausted and RA restarts both the transfermanager and diskmanager daemons to forcibly clear the queues ('the big hammer'). Data transfers begin again, including an uptick in network activity as seen in Ganglia.

Incident details

Summary

Detailed report

Analysis

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

Adding New Hardware

Disk replacement

Reported by: Your Name at date/time

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Draft, Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes