RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade
Contents
RAL-LCG2 Incident 20140218 FTS3 server migration and Outage =====Description:=
The RAL tier 1 runs a test FTS3 service. The machines hosting this service are all virtual machines running on the Microsoft Hyper-V platform. There are two machines rooms at RAL. For various reasons the VMs hosting the FTS3 service were all on a Hyper-V cluster in the Atlas machine room. There was a requirement to do some work on the machine room infrastructure which would interrupt the connectivity to the machines therein. So it was decided to migrate the VMs hosting the FTS3 service to an alternative Hyper-V cluster in the main R89 building.
The most critical part of this operation was moving the FTS3 MYSQL server, as none of the agents could run while the MYSQL server was unavailable. It was decided to shut down the FTS3 service, move the MYSQL server. Then the service could be brought up and other machines could be drained and moved individually. The move of the FTS3 MYSQL server did not complete successfully. The VM remained on the original cluster with the Hyper-V management software reporting that it was in an unmanageable state. In spite of this the machine was available and functional. It was decided to turn on the FTS3 service and continue migrating the other machines.
At approximately 16:00 the FTS3 MYSQL server stopped responding causing the entire FTS3 service to stop working. It was decided to build a new FTS3 MYSQL server on a VM in an R89 Hyper-V cluster. The service was finally restored at approx 19:00
Impact
The FTS3 service was unavailable for two periods of time on the 18th Feb 2014. The first being from 09:00 to 11:50. The second was from 16:00 to 19:00. As a result of the second outage, any transfers that were 'in flight' at the time of the outage were also lost.
Timeline of the Incident
When | What |
---|---|
18/02/2014 09:00 – 10:00 | [AL] Drain FTS3 and stop the MySQL service on mysql-fts3.gridpp.rl.ac.uk |
18/02/2014 10:00 | [TI] Begin migration of mysql-fts3 VM from clusR26prod01 to clusR89prod01 (lcg-hv24) |
18/02/2014 ~11:35 | Migration fails. VMM turns off the VM in clusR26prod01 and marks it as Failed. |
18/02/2014 11:40 | [TI] The VM is unmanageable in VMM. FCM shows the machine offline. Attempted to bring the machine online in FCM. |
18/02/2014 11:50 | [TI] FCM brings the machine online successfully. Informs Andrew that the VM is up and running. |
18/02/2014 ~12:00 | [AL] Turns on FTS3 service |
18/02/2014 12:30-14:00 | [TI] Following discussions between PT and AL, attempt to move mysql-fts01.gridpp.rl.ac.uk to lcghv-24. Migration of mysql-fts01 fails and is in a similar state as mysql-fts3. Attempt migration of mysql-fts02 to another hypervisor in clusR89prod01. |
18/02/2014 14:30 | Migration of mysql-fts02 completes successfully |
18/02/2014 14:45-15:30 | Further discussions between PT and AL. Decided to
|
18/02/2014 ~16:10 | AL reports that mysql-fts3 is unresponsive |
18/02/2014 16:20 | Nagios callout “Can’t connect to MySQL server mysql-fts3″ |
18/02/2014 16:45 | Investigations reveal that the mysql-fts3 virtual machine and its disk image are missing from clusR26prod01 cluster. |
18/02/2014 17:15 | Following discussion between PT, AL and EG decided to
|
18/02/2014 19:00 | [AL] Restarted FTS3 service with new DB |
Incident details
logs from hv27 - Hyper-V server machines were being moved from.
10:09:42 INFO : 'mysql-fts3.gridpp.rl.ac.uk' snapshot successfully. (Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08) 10:14:52 INFO : 'mysql-fts3.gridpp.rl.ac.uk' snapshot successfully. (Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08) 11:35:09 INFO : 'mysql-fts3.gridpp.rl.ac.uk' saved successfully. (Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08) 11:35:34 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline. 11:45:26 INFO : The Cluster service is attempting to bring the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online. 11:51:00 INFO : 'SCVMM mysql-fts3.gridpp.rl.ac.uk' successfully started the virtual machine. 11:51:00 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online. 11:51:55 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline. 11:55:32 INFO : The Cluster service is attempting to bring the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online. 11:55:38 INFO : 'mysql-fts3.gridpp.rl.ac.uk' started successfully. (Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08) 11:55:38 INFO : 'SCVMM mysql-fts3.gridpp.rl.ac.uk' successfully started the virtual machine. 11:55:38 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online. 15:59:12 INFO : 'mysql-fts3.gridpp.rl.ac.uk' saved successfully. (Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08) 15:59:25 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline. 15:59:26 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online. 15:59:26 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline. 15:59:28 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online. 15:59:28 INFO : 'SCVMM mysql-fts3.gridpp.rl.ac.uk Configuration' successfully registered the configuration for the virtual machine. 16:22:18 ERROR: 'mysql-fts3.gridpp.rl.ac.uk' failed to perform the operation. The virtual machine is not in a valid state to perform the operation. (Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
logs from hv24 - Hyper-V server machines were being moved to.
10:20:01 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online. 10:20:01 INFO : The Cluster service is attempting to bring the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online. 10:20:02 INFO : The Cluster service successfully brought the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline. 11:40:47 ERROR: Import failed. Unable to save the virtual machine under location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error: One or more arguments are invalid (0x80070057) 11:41:45 ERROR: Import failed. Unable to save the virtual machine under location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error: One or more arguments are invalid (0x80070057) 11:42:34 ERROR: Import failed. Unable to save the virtual machine under location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error: One or more arguments are invalid (0x80070057) 11:43:58 ERROR: Import failed. Unable to save the virtual machine under location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error: One or more arguments are invalid (0x80070057) 11:51:59 INFO : The Cluster service is attempting to bring the clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online. 11:51:59 ERROR: Cluster resource 'SCVMM mysql-fts3.gridpp.rl.ac.uk Configuration' in clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' failed. 11:51:59 ERROR: 'SCVMM mysql-fts3.gridpp.rl.ac.uk Configuration' failed to register the virtual machine with the virtual machine management service. 11:52:00 ERROR: The Cluster service failed to bring clustered service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
Analysis
This section to include a breakdown of what happened. Include any related issues.
Follow Up
This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.
Issue | Response | Done |
---|---|---|
Issue 1 | Mitigation for issue 1. | Done yes/no |
Issue 2 | Mitigation for issue 2. | Done yes/no |
Related issues
List any related issue and provide links if possible. If there are none then remove this section.
Reported by: Your Name at date/time
Summary Table
Start Date | 18th Febuary 2014 |
Impact | Select one of: >80%, >50%, >20%, <20% |
Duration of Outage | 2 X 3hour outages, 6hours in total |
Status | Draft |
Root Cause | Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load |
Data Loss | Yes/No |