RAL Tier1 Incident 20100720 Database intervention

From GridPP Wiki
Jump to: navigation, search

Status: Closed

Intervention on SAN serving Ogma database caused outage of services

Site: RAL-LCG2

Incident Date: 2010-07-20

Severity: Field not defined yet

Service: Atlas 3D, LHCb 3D, FTS and LFC

Impacted: ATLAS, LHCb, GRIDPP sites, UK)

Incident Summary: A planned intervention on the SAN infrastructure underlying the Ogma, Lugh and Somnus databases caused a failure of 2 of the databases. The rollback of the intervention involved a downtime for all 3 databases for about 1/2 an hour.

Type of Impact: Atlas 3D, LHCb 3D and all VO FTS and LFC service was unavailable.

Incident duration: 1/2 hour.

Report date: 26th July 2010

Reported by: John Kelly, Carmine Cioffi

Related URLs: Original change control ticket at https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=62088

Incident details:

Downtime scheduled to start in GOCDB at: 2010-07-20, 11:00:00

Work started on one node in Ogma. The multipath config file was changed and the node rebooted. Upon rebooting the node, the databases Ogma and Lugh had issues. Streaming stopped on Ogma (Atlas3D) and Lugh (LHCb3D) simply stopped.

The decision was made to not continue with the intervention on the other nodes and to revert to the pre-existing config on the node that had been changed. It was also decided that the safest method of doing this was to stop all 3 databases Ogma, Lugh and Somnus.

This resulted in a 1/2 hour downtime for Ogma, Lugh and Somnus and the services supported by them ie. FTS and LFC and 3D databases.

There was a problem afterwards in that Atlas 3D streaming did not restart. This is believed to be a separate issue not related to this intervention. There was an Oracle Service Request open on this. Steaming was restarted by CERN.


Future mitigation:

The instabilities in the SAN have been noted and taken into account during interventions. A planned migration to a completely new database infrastructure is underway. This will remove these instabilities.

Related issues:

None

Timeline

Date Time Comment
Downtime in GOCDB 20 July 2010 11:00:00 Start time of GOCDB downtime
Ogma1: Streaming process stopped 20 July 2010 11:16:52
Ogma1: Backup archived log 20 July 2010 11:25:36
Ogma1: Shutdown Oracle instance 20 July 2010 11:26:10
Ogma1: Multipath restart 20 July 2010 Done by Martin
Ogma2/3: offline disk ARRAY4_ATLAS3D_ASM_3 20 July 2010 11:44:12 Because of multipath reboot ASM on Ogma2/3 can't see ARRAY4_ATLAS3D_ASM_3 partition
Ogma1: Startup Oracle instance 20 July 2010 11:44:17 Oracle instance came up with not problems because ASM still hold a full copy of the database
Streaming process started 20 July 2010 11:46:26
Ogma1: Stopped ASM rebalancing 20 July 2010 12:14:42 Multipath is still generating errors and Martin wants to reboot the node to see if this would fix the problem
Shutdown Oracle instance 20 July 2010 12:21:03
Ogma1: Shutdown ASM instance 20 July 2010 12:21:57
Lcgdb10 reboot 20 July 2010 Done by Martin Bly
Ogma2/3: offline disk ARRAY4_ATLAS3D_ASM_2 20 July 2010 12:27:28 The multipath reboot cause more problems on Ogma and ASM can't see this time the partition ARRAY4_ATLAS3D_ASM_2
Lugh2: IO errors on ASM 20 July 2010 12:27:43 The multipath reboot cause further disruption on Lugh and ASM can't see most of the disk array partition. Lugh crash
Lugh2: offline ASM-disk ARRAY3_SPARE_ASM and ARRAY4_SPARE_ASM 20 July 2010 12:27:43
Lugh1: offline ARRAY4_SPARE_ASM 20 July 2010 12:31:15
Lugh1: offline ARRAY3_SPARE_ASM 20 July 2010 12:57:16
Lugh1: ASM restarted 20 July 2010 13:06:29
Lugh1: ASM ; ERROR: too many offline disks in PST (grp 1) 20 July 2010 13:06:35 When ASM offline a partition, for further security we rename the OS partition mounting point so that ASM can't mount in case of an OS reboot. Renaming back the mounting point fixed the error.
Lugh1: ASM issued :alter diskgroup lugh_data mount 20 July 2010 13:14:04 Luckily enough Lugh crashed cleanly and could be brought up without any problem. ASM just couldn’t see one partition which had to be synchronize later on.
Lugh1: ASM mount three ASM-disks except ARRAY3_SPARE_ASM 20 July 2010 13:14:08
Lugh2: ASM restarted 20 July 2010 13:16:34
Lugh: ASM start rebalancing 20 July 2010 13:58:38
First mail sent to experiments 20 July 2010 13:58:38 Mail sent to experiments and dashboard updated.
Lugh: ASM rebalance complete 20 July 2010 15:19:58
Ogma,Lugh, Somnus shutdown 20 July 2010 15:47 To rollback the multipath changes on Ogma1 and avoid further disruption Ogma, Lugh and Somnus are shutdown. In this way a multipath reboot could not have any side effect.
Ogma,Lugh, Somnus startup 20 July 2010 16:03 Multipath configuration changes have being rolled back.
Mail sent to experiments 20 July 2010 17:15:00 Mail sent and dashboard updated stating problems with restarting 3D streaming.
Meeting to discuss this. 21 July 2010 11:00:00 We formally decided to abandon the planned interventions on Lugh and Somnus database. Downtimes cancelled in GOCDB.