RAL Tier1 Incident 20100720 Database intervention
Status: Closed
Intervention on SAN serving Ogma database caused outage of services
Site: RAL-LCG2
Incident Date: 2010-07-20
Severity: Field not defined yet
Service: Atlas 3D, LHCb 3D, FTS and LFC
Impacted: ATLAS, LHCb, GRIDPP sites, UK)
Incident Summary: A planned intervention on the SAN infrastructure underlying the Ogma, Lugh and Somnus databases caused a failure of 2 of the databases. The rollback of the intervention involved a downtime for all 3 databases for about 1/2 an hour.
Type of Impact: Atlas 3D, LHCb 3D and all VO FTS and LFC service was unavailable.
Incident duration: 1/2 hour.
Report date: 26th July 2010
Reported by: John Kelly, Carmine Cioffi
Related URLs: Original change control ticket at https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=62088
Incident details:
Downtime scheduled to start in GOCDB at: 2010-07-20, 11:00:00
Work started on one node in Ogma. The multipath config file was changed and the node rebooted. Upon rebooting the node, the databases Ogma and Lugh had issues. Streaming stopped on Ogma (Atlas3D) and Lugh (LHCb3D) simply stopped.
The decision was made to not continue with the intervention on the other nodes and to revert to the pre-existing config on the node that had been changed. It was also decided that the safest method of doing this was to stop all 3 databases Ogma, Lugh and Somnus.
This resulted in a 1/2 hour downtime for Ogma, Lugh and Somnus and the services supported by them ie. FTS and LFC and 3D databases.
There was a problem afterwards in that Atlas 3D streaming did not restart. This is believed to be a separate issue not related to this intervention. There was an Oracle Service Request open on this. Steaming was restarted by CERN.
Future mitigation:
The instabilities in the SAN have been noted and taken into account during interventions. A planned migration to a completely new database infrastructure is underway. This will remove these instabilities.
Related issues:
None
Timeline
Date | Time | Comment | |
---|---|---|---|
Downtime in GOCDB | 20 July 2010 | 11:00:00 | Start time of GOCDB downtime |
Ogma1: Streaming process stopped | 20 July 2010 | 11:16:52 | |
Ogma1: Backup archived log | 20 July 2010 | 11:25:36 | |
Ogma1: Shutdown Oracle instance | 20 July 2010 | 11:26:10 | |
Ogma1: Multipath restart | 20 July 2010 | Done by Martin | |
Ogma2/3: offline disk ARRAY4_ATLAS3D_ASM_3 | 20 July 2010 | 11:44:12 | Because of multipath reboot ASM on Ogma2/3 can't see ARRAY4_ATLAS3D_ASM_3 partition |
Ogma1: Startup Oracle instance | 20 July 2010 | 11:44:17 | Oracle instance came up with not problems because ASM still hold a full copy of the database |
Streaming process started | 20 July 2010 | 11:46:26 | |
Ogma1: Stopped ASM rebalancing | 20 July 2010 | 12:14:42 | Multipath is still generating errors and Martin wants to reboot the node to see if this would fix the problem |
Shutdown Oracle instance | 20 July 2010 | 12:21:03 | |
Ogma1: Shutdown ASM instance | 20 July 2010 | 12:21:57 | |
Lcgdb10 reboot | 20 July 2010 | Done by Martin Bly | |
Ogma2/3: offline disk ARRAY4_ATLAS3D_ASM_2 | 20 July 2010 | 12:27:28 | The multipath reboot cause more problems on Ogma and ASM can't see this time the partition ARRAY4_ATLAS3D_ASM_2 |
Lugh2: IO errors on ASM | 20 July 2010 | 12:27:43 | The multipath reboot cause further disruption on Lugh and ASM can't see most of the disk array partition. Lugh crash |
Lugh2: offline ASM-disk ARRAY3_SPARE_ASM and ARRAY4_SPARE_ASM | 20 July 2010 | 12:27:43 | |
Lugh1: offline ARRAY4_SPARE_ASM | 20 July 2010 | 12:31:15 | |
Lugh1: offline ARRAY3_SPARE_ASM | 20 July 2010 | 12:57:16 | |
Lugh1: ASM restarted | 20 July 2010 | 13:06:29 | |
Lugh1: ASM ; ERROR: too many offline disks in PST (grp 1) | 20 July 2010 | 13:06:35 | When ASM offline a partition, for further security we rename the OS partition mounting point so that ASM can't mount in case of an OS reboot. Renaming back the mounting point fixed the error. |
Lugh1: ASM issued :alter diskgroup lugh_data mount | 20 July 2010 | 13:14:04 | Luckily enough Lugh crashed cleanly and could be brought up without any problem. ASM just couldn’t see one partition which had to be synchronize later on. |
Lugh1: ASM mount three ASM-disks except ARRAY3_SPARE_ASM | 20 July 2010 | 13:14:08 | |
Lugh2: ASM restarted | 20 July 2010 | 13:16:34 | |
Lugh: ASM start rebalancing | 20 July 2010 | 13:58:38 | |
First mail sent to experiments | 20 July 2010 | 13:58:38 | Mail sent to experiments and dashboard updated. |
Lugh: ASM rebalance complete | 20 July 2010 | 15:19:58 | |
Ogma,Lugh, Somnus shutdown | 20 July 2010 | 15:47 | To rollback the multipath changes on Ogma1 and avoid further disruption Ogma, Lugh and Somnus are shutdown. In this way a multipath reboot could not have any side effect. |
Ogma,Lugh, Somnus startup | 20 July 2010 | 16:03 | Multipath configuration changes have being rolled back. |
Mail sent to experiments | 20 July 2010 | 17:15:00 | Mail sent and dashboard updated stating problems with restarting 3D streaming. |
Meeting to discuss this. | 21 July 2010 | 11:00:00 | We formally decided to abandon the planned interventions on Lugh and Somnus database. Downtimes cancelled in GOCDB. |