RAL Tier1 Incident 20140218 FTS3 server migration and Outage

From GridPP Wiki
Jump to: navigation, search

RAL-LCG2 Incident 20140218 FTS3 server migration and Outage =====Description:=

The RAL tier 1 runs a test FTS3 service. The machines hosting this service are all virtual machines running on the Microsoft Hyper-V platform. There are two machines rooms at RAL. For various reasons the VMs hosting the FTS3 service were all on a Hyper-V cluster in the Atlas machine room. There was a requirement to do some work on the machine room infrastructure which would interrupt the connectivity to the machines therein. So it was decided to migrate the VMs hosting the FTS3 service to an alternative Hyper-V cluster in the main R89 building.

The most critical part of this operation was moving the FTS3 MYSQL server, as none of the agents could run while the MYSQL server was unavailable. It was decided to shut down the FTS3 service, move the MYSQL server. Then the service could be brought up and other machines could be drained and moved individually. The move of the FTS3 MYSQL server did not complete successfully. The VM remained on the original cluster with the Hyper-V management software reporting that it was in an unmanageable state. In spite of this the machine was available and functional. It was decided to turn on the FTS3 service and continue migrating the other machines.

At approximately 16:00 the FTS3 MYSQL server stopped responding causing the entire FTS3 service to stop working. It was decided to build a new FTS3 MYSQL server on a VM in an R89 Hyper-V cluster. The service was finally restored at approx 19:00

Impact

The FTS3 service was unavailable for two periods of time on the 18th Feb 2014. The first being from 09:00 to 11:50. The second was from 16:00 to 19:00. As a result of the second outage, any transfers that were 'in flight' at the time of the outage were also lost.

Timeline of the Incident

When What
18/02/2014 09:00 – 10:00 [AL] Drain FTS3 and stop the MySQL service on mysql-fts3.gridpp.rl.ac.uk
18/02/2014 10:00 [TI] Begin migration of mysql-fts3 VM from clusR26prod01 to clusR89prod01 (lcg-hv24)
18/02/2014 ~11:35 Migration fails. VMM turns off the VM in clusR26prod01 and marks it as Failed.
18/02/2014 11:40 [TI] The VM is unmanageable in VMM. FCM shows the machine offline. Attempted to bring the machine online in FCM.
18/02/2014 11:50 [TI] FCM brings the machine online successfully. Informs Andrew that the VM is up and running.
18/02/2014 ~12:00 [AL] Turns on FTS3 service
18/02/2014 12:30-14:00 [TI] Following discussions between PT and AL, attempt to move mysql-fts01.gridpp.rl.ac.uk to lcghv-24. Migration of mysql-fts01 fails and is in a similar state as mysql-fts3. Attempt migration of mysql-fts02 to another hypervisor in clusR89prod01.
18/02/2014 14:30 Migration of mysql-fts02 completes successfully
18/02/2014 14:45-15:30 Further discussions between PT and AL. Decided to
  • Leave the VM in clusR26prod01.
  • Drain FTS3 and stop the MySQL service for the intervention on Wednesday morning
  • Discuss with VOs and schedule a move of the database after the intervention.
18/02/2014 ~16:10 AL reports that mysql-fts3 is unresponsive
18/02/2014 16:20 Nagios callout “Can’t connect to MySQL server mysql-fts3″
18/02/2014 16:45 Investigations reveal that the mysql-fts3 virtual machine and its disk image are missing from clusR26prod01 cluster.
18/02/2014 17:15 Following discussion between PT, AL and EG decided to
  • [AL] Create new VM for mysql-fts3 on clusR89prod01
  • [TI] Finish migration of remaining FTS3 frontend to clusR89prod01
  • [AL] Start FTS3 service on new DB
18/02/2014 19:00 [AL] Restarted FTS3 service with new DB

Incident details

logs from hv27 - Hyper-V server machines were being moved from.

10:09:42 INFO : 'mysql-fts3.gridpp.rl.ac.uk' snapshot successfully.
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
10:14:52 INFO : 'mysql-fts3.gridpp.rl.ac.uk' snapshot successfully.
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
11:35:09 INFO : 'mysql-fts3.gridpp.rl.ac.uk' saved successfully.
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
11:35:34 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
11:45:26 INFO : The Cluster service is attempting to bring the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
11:51:00 INFO : 'SCVMM mysql-fts3.gridpp.rl.ac.uk' successfully started
the virtual machine.
11:51:00 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
11:51:55 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
11:55:32 INFO : The Cluster service is attempting to bring the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
11:55:38 INFO : 'mysql-fts3.gridpp.rl.ac.uk' started successfully.
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
11:55:38 INFO : 'SCVMM mysql-fts3.gridpp.rl.ac.uk' successfully started
the virtual machine.
11:55:38 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
15:59:12 INFO : 'mysql-fts3.gridpp.rl.ac.uk' saved successfully.
(Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)
15:59:25 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
15:59:26 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
15:59:26 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
15:59:28 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
15:59:28 INFO : 'SCVMM mysql-fts3.gridpp.rl.ac.uk Configuration'
successfully registered the configuration for the virtual machine.
16:22:18 ERROR: 'mysql-fts3.gridpp.rl.ac.uk' failed to perform the
operation. The virtual machine is not in a valid state to perform the
operation. (Virtual machine ID 2F40AEA1-4F54-460D-8961-938334B54A08)


logs from hv24 - Hyper-V server machines were being moved to.

10:20:01 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
10:20:01 INFO : The Cluster service is attempting to bring the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
10:20:02 INFO : The Cluster service successfully brought the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' offline.
11:40:47 ERROR: Import failed. Unable to save the virtual machine under
location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error:
One or more arguments are invalid (0x80070057)
11:41:45 ERROR: Import failed. Unable to save the virtual machine under
location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error:
One or more arguments are invalid (0x80070057)
11:42:34 ERROR: Import failed. Unable to save the virtual machine under
location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error:
One or more arguments are invalid (0x80070057)
11:43:58 ERROR: Import failed. Unable to save the virtual machine under
location 'C:\ClusterStorage\Volume1\mysql-fts3.gridpp.rl.ac.uk'. Error:
One or more arguments are invalid (0x80070057)
11:51:59 INFO : The Cluster service is attempting to bring the clustered
service or application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' online.
11:51:59 ERROR: Cluster resource 'SCVMM mysql-fts3.gridpp.rl.ac.uk
Configuration' in clustered service or application 'SCVMM
mysql-fts3.gridpp.rl.ac.uk Resources' failed.
11:51:59 ERROR: 'SCVMM mysql-fts3.gridpp.rl.ac.uk Configuration' failed
to register the virtual machine with the virtual machine management service.
11:52:00 ERROR: The Cluster service failed to bring clustered service or
application 'SCVMM mysql-fts3.gridpp.rl.ac.uk Resources' completely
online or offline. One or more resources may be in a failed state. This
may impact the availability of the clustered service or application.

Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date 18th Febuary 2014
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage 2 X 3hour outages, 6hours in total
Status Draft
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No