RAL Tier1 Incident 20100720 Database intervention

Status: Closed

Intervention on SAN serving Ogma database caused outage of services

Site: RAL-LCG2

Incident Date: 2010-07-20

Severity: Field not defined yet

Service: Atlas 3D, LHCb 3D, FTS and LFC

Impacted: ATLAS, LHCb, GRIDPP sites, UK)

Incident Summary: A planned intervention on the SAN infrastructure underlying the Ogma, Lugh and Somnus databases caused a failure of 2 of the databases. The rollback of the intervention involved a downtime for all 3 databases for about 1/2 an hour.

Type of Impact: Atlas 3D, LHCb 3D and all VO FTS and LFC service was unavailable.

Incident duration: 1/2 hour.

Report date: 26th July 2010

Reported by: John Kelly, Carmine Cioffi

Related URLs: Original change control ticket at https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=62088

Incident details:

Downtime scheduled to start in GOCDB at: 2010-07-20, 11:00:00

Work started on one node in Ogma. The multipath config file was changed and the node rebooted. Upon rebooting the node, the databases Ogma and Lugh had issues. Streaming stopped on Ogma (Atlas3D) and Lugh (LHCb3D) simply stopped.

The decision was made to not continue with the intervention on the other nodes and to revert to the pre-existing config on the node that had been changed. It was also decided that the safest method of doing this was to stop all 3 databases Ogma, Lugh and Somnus.

This resulted in a 1/2 hour downtime for Ogma, Lugh and Somnus and the services supported by them ie. FTS and LFC and 3D databases.

There was a problem afterwards in that Atlas 3D streaming did not restart. This is believed to be a separate issue not related to this intervention. There was an Oracle Service Request open on this. Steaming was restarted by CERN.

Future mitigation:

The instabilities in the SAN have been noted and taken into account during interventions. A planned migration to a completely new database infrastructure is underway. This will remove these instabilities.

Related issues:

None

Timeline


	Date	Time	Comment
Downtime in GOCDB	20 July 2010	11:00:00	Start time of GOCDB downtime
Ogma1: Streaming process stopped	20 July 2010	11:16:52
Ogma1: Backup archived log	20 July 2010	11:25:36
Ogma1: Shutdown Oracle instance	20 July 2010	11:26:10
Ogma1: Multipath restart	20 July 2010		Done by Martin
Ogma2/3: offline disk ARRAY4_ATLAS3D_ASM_3	20 July 2010	11:44:12	Because of multipath reboot ASM on Ogma2/3 can't see ARRAY4_ATLAS3D_ASM_3 partition
Ogma1: Startup Oracle instance	20 July 2010	11:44:17	Oracle instance came up with not problems because ASM still hold a full copy of the database
Streaming process started	20 July 2010	11:46:26
Ogma1: Stopped ASM rebalancing	20 July 2010	12:14:42	Multipath is still generating errors and Martin wants to reboot the node to see if this would fix the problem
Shutdown Oracle instance	20 July 2010	12:21:03
Ogma1: Shutdown ASM instance	20 July 2010	12:21:57
Lcgdb10 reboot	20 July 2010		Done by Martin Bly
Ogma2/3: offline disk ARRAY4_ATLAS3D_ASM_2	20 July 2010	12:27:28	The multipath reboot cause more problems on Ogma and ASM can't see this time the partition ARRAY4_ATLAS3D_ASM_2
Lugh2: IO errors on ASM	20 July 2010	12:27:43	The multipath reboot cause further disruption on Lugh and ASM can't see most of the disk array partition. Lugh crash
Lugh2: offline ASM-disk ARRAY3_SPARE_ASM and ARRAY4_SPARE_ASM	20 July 2010	12:27:43
Lugh1: offline ARRAY4_SPARE_ASM	20 July 2010	12:31:15
Lugh1: offline ARRAY3_SPARE_ASM	20 July 2010	12:57:16
Lugh1: ASM restarted	20 July 2010	13:06:29
Lugh1: ASM ; ERROR: too many offline disks in PST (grp 1)	20 July 2010	13:06:35	When ASM offline a partition, for further security we rename the OS partition mounting point so that ASM can't mount in case of an OS reboot. Renaming back the mounting point fixed the error.
Lugh1: ASM issued :alter diskgroup lugh_data mount	20 July 2010	13:14:04	Luckily enough Lugh crashed cleanly and could be brought up without any problem. ASM just couldn’t see one partition which had to be synchronize later on.
Lugh1: ASM mount three ASM-disks except ARRAY3_SPARE_ASM	20 July 2010	13:14:08
Lugh2: ASM restarted	20 July 2010	13:16:34
Lugh: ASM start rebalancing	20 July 2010	13:58:38
First mail sent to experiments	20 July 2010	13:58:38	Mail sent to experiments and dashboard updated.
Lugh: ASM rebalance complete	20 July 2010	15:19:58
Ogma,Lugh, Somnus shutdown	20 July 2010	15:47	To rollback the multipath changes on Ogma1 and avoid further disruption Ogma, Lugh and Somnus are shutdown. In this way a multipath reboot could not have any side effect.
Ogma,Lugh, Somnus startup	20 July 2010	16:03	Multipath configuration changes have being rolled back.
Mail sent to experiments	20 July 2010	17:15:00	Mail sent and dashboard updated stating problems with restarting 3D streaming.
Meeting to discuss this.	21 July 2010	11:00:00	We formally decided to abandon the planned interventions on Lugh and Somnus database. Downtimes cancelled in GOCDB.

RAL Tier1 Incident 20100720 Database intervention

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools