Difference between revisions of "RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade"

From GridPP Wiki
Jump to: navigation, search
(Created page with "==RAL-LCG2 Incident 20140218 FTS3 server migration and Outage =====Description:=== The RAL tier 1 runs a test FTS3 service. The machines hosting this service are all virtual m...")
 
Line 1: Line 1:
==RAL-LCG2 Incident 20140218 FTS3 server migration and Outage =====Description:===
+
==RAL-LCG2 Incident 20150408 network intervention preceding Castor upgrade =====Description:===
The RAL tier 1 runs a test FTS3 service. The machines hosting this service are all virtual machines running on the Microsoft Hyper-V platform. There are two machines rooms at RAL. For various reasons the VMs hosting the FTS3 service were all on a Hyper-V cluster in the Atlas machine room. There was a requirement to do some work on the machine room infrastructure which would interrupt the connectivity to the machines therein. So it was decided to migrate the VMs hosting the FTS3 service to an alternative Hyper-V cluster in the main R89 building.
+
  
The most critical part of this operation was moving the FTS3 MYSQL server, as none of the agents could run while the MYSQL server was unavailable. It was decided to shut down the FTS3 service, move the MYSQL server. Then the service could be brought up and other machines could be drained and moved individually.
+
Change control for Castor upgrade:
The move of the FTS3 MYSQL server did not complete successfully. The VM remained on the original cluster with the Hyper-V management software reporting that it was in an unmanageable state. In spite of this the machine was available and functional. It was decided to turn on the FTS3 service and continue migrating the other machines.  
+
https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=148453
 +
 
 +
Castor upgrade procedure:
 +
https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CastorUpgradeTo211415
 +
 
 +
RT ticket tracking upgrade:
 +
https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=149684
 +
 
 +
The Castor team were aware of the network intervention but did not include it in their plan as it was believed to be minor.
  
At approximately 16:00 the FTS3 MYSQL server stopped responding causing the entire FTS3 service to stop working. It was decided to build a new FTS3 MYSQL server on a VM in an R89 Hyper-V cluster.
 
The service was finally restored at approx 19:00
 
  
 
===Impact===
 
===Impact===
The FTS3 service was unavailable for two periods of time on the 18th Feb 2014. The first being from 09:00 to 11:50. The second was from 16:00 to 19:00. As a result of the second outage, any transfers that were 'in flight' at the time of the outage were also lost.
+
The entire Tier 1 and all services were declared in downtime until the following day.
  
 
===Timeline of the Incident===
 
===Timeline of the Incident===
Line 19: Line 24:
 
!What
 
!What
 
|-
 
|-
| 18/02/2014 09:00 – 10:00
+
| 8/04/2015 10:00
| [AL] Drain FTS3 and stop the MySQL service on mysql-fts3.gridpp.rl.ac.uk
+
| Downtime starts in the GOCDB. Very shortly afterwards the network dropped (The offices could not contact the machine room). Castor team (Rob A) spoke to Fabric. Martin Bly, was surprised that the network had gone down but he also believed Castor was already down.
 
|-
 
|-
| 18/02/2014 10:00
+
| 8/04/2015 10:15 - 10:40
| [TI] Begin migration of mysql-fts3 VM from clusR26prod01 to clusR89prod01 (lcg-hv24)
+
| Network is restored and Castor is brought down cleanly.  DB team asked to start work.  Rob A commits quattor change for software upgrade (as planned).
 
|-
 
|-
| 18/02/2014 ~11:35
+
| 8/04/2015 11:45
| Migration fails. VMM turns off the VM in clusR26prod01 and marks it as Failed.
+
| Castor change is abandon and rollback starts. Note: DB team never started upgrade.
|-
+
| 18/02/2014 11:40
+
| [TI] The VM is unmanageable in VMM. FCM shows the machine offline. Attempted to bring the machine online in FCM.
+
|-
+
| 18/02/2014 11:50
+
| [TI] FCM brings the machine online successfully. Informs Andrew that the VM is up and running.
+
|-
+
| 18/02/2014 ~12:00
+
| [AL] Turns on FTS3 service
+
|-
+
| 18/02/2014 12:30-14:00
+
| [TI] Following discussions between PT and AL, attempt to move mysql-fts01.gridpp.rl.ac.uk to lcghv-24. Migration of mysql-fts01 fails and is in a similar state as mysql-fts3. Attempt migration of mysql-fts02 to another hypervisor in clusR89prod01.
+
|-
+
| 18/02/2014 14:30
+
| Migration of mysql-fts02 completes successfully
+
|-
+
| 18/02/2014 14:45-15:30
+
| Further discussions between PT and AL. Decided to
+
* Leave the VM in clusR26prod01.
+
* Drain FTS3 and stop the MySQL service for the intervention on Wednesday morning
+
* Discuss with VOs and schedule a move of the database after the intervention.
+
|-
+
| 18/02/2014 ~16:10
+
| AL reports that mysql-fts3 is unresponsive
+
|-
+
| 18/02/2014 16:20
+
| Nagios callout “Can’t connect to MySQL server mysql-fts3″
+
|-
+
| 18/02/2014 16:45
+
| Investigations reveal that the mysql-fts3 virtual machine and its disk image are missing from clusR26prod01 cluster.
+
|-
+
| 18/02/2014 17:15
+
| Following discussion between PT, AL and EG decided to
+
*[AL] Create new VM for mysql-fts3 on clusR89prod01
+
*[TI] Finish migration of remaining FTS3 frontend to clusR89prod01
+
*[AL] Start FTS3 service on new DB
+
|-
+
| 18/02/2014 19:00
+
| [AL] Restarted FTS3 service with new DB
+
 
|}
 
|}
  
 
===Incident details===
 
===Incident details===
  
 
logs from hv27 - Hyper-V server machines were being moved from.
 
<pre>
 
10:09:42 INFO : 'mysql-fts3.gridpp.rl.ac.uk' snapshot successfully.
 
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
 
10:14:52 INFO : 'mysql-fts3.gridpp.rl.ac.uk' snapshot successfully.
 
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
 
11:35:09 INFO : 'mysql-fts3.gridpp.rl.ac.uk' saved successfully.
 
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
 
11:35:34 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
 
11:45:26 INFO : The Cluster service is attempting to bring the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
 
11:51:00 INFO : 'SCVMM mysql-fts3.gridpp.rl.ac.uk' successfully started
 
the virtual machine.
 
11:51:00 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
 
11:51:55 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
 
11:55:32 INFO : The Cluster service is attempting to bring the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
 
11:55:38 INFO : 'mysql-fts3.gridpp.rl.ac.uk' started successfully.
 
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
 
11:55:38 INFO : 'SCVMM mysql-fts3.gridpp.rl.ac.uk' successfully started
 
the virtual machine.
 
11:55:38 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
 
15:59:12 INFO : 'mysql-fts3.gridpp.rl.ac.uk' saved successfully.
 
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
 
15:59:25 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
 
15:59:26 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
 
15:59:26 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
 
15:59:28 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
 
15:59:28 INFO : 'SCVMM mysql-fts3.gridpp.rl.ac.uk Configuration'
 
successfully registered the configuration for the virtual machine.
 
16:22:18 ERROR: 'mysql-fts3.gridpp.rl.ac.uk' failed to perform the
 
operation. The virtual machine is not in a valid state to perform the
 
operation. (Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
 
</pre>
 
 
 
logs from hv24 - Hyper-V server machines were being moved to.
 
<pre>
 
10:20:01 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
 
10:20:01 INFO : The Cluster service is attempting to bring the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
 
10:20:02 INFO : The Cluster service successfully brought the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
 
11:40:47 ERROR: Import failed. Unable to save the virtual machine under
 
location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error:
 
One or more arguments are invalid (0x80070057)
 
11:41:45 ERROR: Import failed. Unable to save the virtual machine under
 
location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error:
 
One or more arguments are invalid (0x80070057)
 
11:42:34 ERROR: Import failed. Unable to save the virtual machine under
 
location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error:
 
One or more arguments are invalid (0x80070057)
 
11:43:58 ERROR: Import failed. Unable to save the virtual machine under
 
location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error:
 
One or more arguments are invalid (0x80070057)
 
11:51:59 INFO : The Cluster service is attempting to bring the clustered
 
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
 
11:51:59 ERROR: Cluster resource 'SCVMM mysql-fts3.gridpp.rl.ac.uk
 
Configuration' in clustered service or application 'SCVMM
 
mysql-fts3.gridpp.rl.ac.uk Resources' failed.
 
11:51:59 ERROR: 'SCVMM mysql-fts3.gridpp.rl.ac.uk Configuration' failed
 
to register the virtual machine with the virtual machine management service.
 
11:52:00 ERROR: The Cluster service failed to bring clustered service or
 
application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' completely
 
online or offline. One or more resources may be in a failed state. This
 
may impact the availability of the clustered service or application.
 
</pre>
 
  
 
===Analysis===
 
===Analysis===

Revision as of 17:21, 8 April 2015

RAL-LCG2 Incident 20150408 network intervention preceding Castor upgrade =====Description:=

Change control for Castor upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=148453

Castor upgrade procedure: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CastorUpgradeTo211415

RT ticket tracking upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=149684

The Castor team were aware of the network intervention but did not include it in their plan as it was believed to be minor.


Impact

The entire Tier 1 and all services were declared in downtime until the following day.

Timeline of the Incident

When What
8/04/2015 10:00 Downtime starts in the GOCDB. Very shortly afterwards the network dropped (The offices could not contact the machine room). Castor team (Rob A) spoke to Fabric. Martin Bly, was surprised that the network had gone down but he also believed Castor was already down.
8/04/2015 10:15 - 10:40 Network is restored and Castor is brought down cleanly. DB team asked to start work. Rob A commits quattor change for software upgrade (as planned).
8/04/2015 11:45 Castor change is abandon and rollback starts. Note: DB team never started upgrade.

Incident details

Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date 18th Febuary 2014
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage 2 X 3hour outages, 6hours in total
Status Draft
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No