RAL Tier1 Incident 20120613 Oracle11 Update Failure

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 13th June 2012: Oracle11 Update Failure

Description:

The scheduled update of the SOMNUS Oracle RAC to Oracle 11 failed. This Oracle RAC holds the databases for the RAL FTS service and the LFC service used by a number of non-LHC VOs (ILC, T2K.org, MICE, MINOS, SNO). The reversion plan of removing the software update also failed. In order to recover services a clean FTS database was set-up on another database system. The non-LHC VO's LFC suffered an extended outage until the Oracle software was re-installed and the data exported and re-imported.

Impact

The FTS service was restarted within the scheduled downtime. However, this was achieved by creating a new FTS database. This caused the list of queued transfers to be lost, which led to significant additional work for at least one of the VOs (Atlas) using the service.

The LFC service for the non-LHC VOs was fully restored. However, the outage was of considerable length, with this service unavailable for around 48 hours in total.

Timeline of the Incident

When What
13th June 08:00 Start of scheduled outage in GOC DB for LFC & FTS services. (Until 17:00)
13th June 08:23 start database backup
13th June 09:01 backup completed
13th June 09:28 Services (FTS, LFC) confirmed to be down, started the backup of the last changes
13th June 09:30 Backup completed
13th June 09:32 Started the upgrade process
13th June 10:47 Pre-upgrade checks done successfully, started to run the upgrade application
13th June 11:20 Grid Infrastructure installed and running successfully, ASM upgrade failed (ASM failed to startup so it couldn't be upgraded)
13th June 11:37 Fixed the problem with ASM and retried the ASM upgrade
13th June 11:46 ASM started but the disk group was not mounted
13th June ~12:00 ASM problem was fixed but couldn't re-run the upgrade because the Oracle installer GUI was killed due to a network glitch. All the Putty sessio to the host also went down. At this point there was no way to restart the upgrade from where it stopped. Rich decides we should roll back given the amount of time elapsed and lack of obvious fix.
13th June 12:10
Jun 13 12:11:24 lcgdb05 sshd[27168]: Read error from remote host ::ffff:130.246.76.250: Connection reset by peer

Followed immediately by a reconnect:

Jun 13 12:11:38 lcgdb05 sshd[29707]: Connection from ::ffff:130.246.76.250 port 4745
Jun 13 12:12:22 lcgdb05 sshd[29708]: Postponed publickey for oracle from ::ffff:130.246.76.250 port 4745 ssh2
Jun 13 12:12:22 lcgdb05 sshd[29707]: Accepted publickey for oracle from ::ffff:130.246.76.250 port 4745 ssh2 

There is no record of any port drop on the somnus stack since March. This includes the uplinks to the S4810. The s4810 logs do not go back as far as the period of interest.

None of the system logs (/var/log/messages.1) on lcgdb0[567] indicate any PHY layer connection drop other than those consistent and contiguous with system reboots.

13th June 12:20 Could not run the de-installer because this would wipe out the database data, so I started the manual process
13th June 13:07 Gareth announced we are rolling back. Gave estimate of a couple of hours to do this.
13th June 14:45 After replacing the OCR and VD and making all the configuration changes the 10g clusterware would not start. No logging information were generated, so there was not way to debug the problem. Decision to restore Somnus from backup elsewhere to validate the backup of the LFC is taken.
13th June 15:00 Started copying backup files over lcgdb02 (new CASTOR preprod machines)
13th June 15:02 Extend outage in LFC until 17:00 next day.
13th June 15:17 Note from ticket: FTS: Creating a fresh database (on Maia I think). Andrew L. running scripts to create that and then point FTS at that. This is a production quality service.
13th June 15:05 started to install on lcgdb02 Oracle 10.2.0.5
13th June 16:20 copy of Somnus backup files to new host completed
13th June 16:28 FTS service up and running on Maia.
13th June 16:30 Applied kernel updates to systems in SOMNUS RAC
13th June 17:00 End of scheduled Outage on FTS & LFC
13th June 18:00 Oracle 10.2.0.5 installation completed on lcgdb02
13th June 18:30 started to restore backup on lcgdb02
14th June 08:20 Backup restored on lcgdb02
14th June 08:30 Started to export LFC from restored backup on lcgdb02
14th June 09:32 export completed and copied over to Ogma1
14th June 10:47 LFC was imported successfully in the database. Somnus DB can now be deleted
14th June 11:05 Started installation of Oracle 11.2.0.3 on Somnus nodes
14th June ~14:00 Oracle 11.2.0.3 installed, starting to apply latest patches
14th June 14:35 Extend outage in GOC DB for LFC until next morning (15th).
14th June 17:30 Oracle is ready and started to import the LFC data into it
14th June 18:12 Import completed. Started to run checks on the RAC to make sure the installation was OK.
15th June ~08:30 Catalin started the LFC frontend and did make few checks with success.
15th June ~09:12 LFC Outage in GOC DB ended.

Incident details

After the pre-upgrade checks, which are done to make sure the current system is OK for the upgrade, the Oracle installer GUI was used to install the software and do the upgraded. The upgrade process failed around half way through at the stage of upgrading the ASM. Luckily the Oracle GUI allows the failing step to be retried, so it was possible to fix the cause of the failure and re-run this step. However, when this was being done (most likely due to a short network problem) the Oracle GUI (plus the local Putty connection to the host) were lost. At this point it was not possible to complete the upgrade.

Attempts were made to roll back the changes manually by re-instating all the configuration files and binaries as they were before the upgrade. However, this was not enough to completely recover the software and the old Oracle 10 software version would not start. No logging information were generated so there was no way to identify the cause of the problem.

At this stage it was decided that the only possible action was to clean up everything including the old Oracle version and completely re-install using Oracle 11. It was decided not to call Oracle support to assist as it was expected that investigating and fixing the software would lead to a longer outage than simply re-installing.

In order to be able to restore the FTS service quickly a decision was taken to create a new, empty, FTS database on a separate Oracle instance which was available with sufficient capacity. This was successfully carried out. A new, clean FTS database was created, the FTS front ends and agent systems updated to use the FTS in this new location and the services restarted. The FTS service was restored within the original scheduled window for the Outage. However, the list of queued file transfers was lost.

Owing to the critical nature of the LFC data a different approach was used for this. Before installing Oracle 11 on the SOMNUS RAC, and thereby destroying the known good database, it was necessary to validate the backups on a working Oracle 10 system. However, none were available and one had to be set-up elsewhere which took considerable time. The backups were then copied off the failed system (SOMNUS). Kernel updates were then applied to the original (Somnus) system in order to prepare it for when the Oracle software would be re-installed.

The LFC service is used by five VOs (ILC, T2K.org, MICE, MINOS, SNO) and relevant representatives were notified of the extended outage to this service.

Once the backups had been verified the Oracle software on the original SOMNUS RAC was re-installed with Oracle version 11. The opportunity was taken while the service was down to apply the latest Oracle patches, so as to remove the requirement for a further service break later. The database was re-imported and the LFC service was re-established.

Analysis

This type of update has been done before by the RAL Tier1 database Team and the same procedure was being followed.

In this case problems were encountered updating the Oracle ASM software. To run the GUI an X-server (eXceed) was used on the local desktop so that the GUI window can be opened on the local desktop screen. There was an interruption to or between the system used to display the Oracle GUI being used for the upgrade and the system being upgraded resulting in the termination of the GUI and associated PuTTy session. The cause of this interruption are coincident with a short problem on the network as illustrated in the information below. This is believed to be the cause of the failure of the GUI and Putty sessions. Indications are that there is a problem with one of the site routers that should be followed up.

At this point the Oracle software was in an unknown state. The update had already been problematic when updating the ASM component and it was not known how far the update process (via the GUI) had gone. There was no logging information available for the upgrade process itself. The database was untouched by the upgrade process thus far.

A decision was taken to revert the change. Oracle provide a software un-installer but this also removes the database itself. As the backup had not yet been verified to be good this was not used - just in case problems were found with the backups. A manual un-install of the Oracle 11 software (to leave the previous Oracle 10 installation) was undertaken. However, this process itself ran into problems. The Oracle 10 system was unable to start. As noted above, a decision was taken not to involve Oracle support as it was felt a re-install would lead to a quicker resolution of the current problem.

Care was appropriately taken not to use the Oracle tools that would jeopardize the data. However, the lack of a system already available running a version of Oracle 10 to validate the backup was a significant cause of the delay in restoring the LFC service.

The application of kernel updates to the database systems was carried out before the backups had been validated. It is theoretically possible for these to interfere with the restoration of the service. Such updates, whilst useful before a service is restarted, should not be done until the data is known to be safe - in this case that the backups had been verified.

The decisions to re-install the Oracle software and to restart the FTS service with a new database were taken following consideration by the database team, the FTS service manager and the Tier1 Production manager. The careful approach to the restoration of the LFC service was also the result of discussions involving the Tier1 Production manager and the database team. It could have been possible to return the LFC to service late on Thursday (14th). However, the Tier1 Production Manager requested that a final set of checks be made in the morning before opening up the service.

The following plot shows a peak in the rtt for packets through the main router situated between the Office system being used to display the GUI for the update procedure and the target system:

File:Router-a-latency.png

The following plot shows a spike in packet loss across a network path that, in part, replicates that being used between the Office system being used to display the GUI for the update procedure and the target system:

File:PingtestFailureWed13June.JPG

The following plot shows the rate of events being signalled by the main router situated between the Office system being used to display the GUI for the update procedure and the target system:

File:Screenshot at 2012-06-22 11-13-47.png

Follow Up

Issue Response Done
A problem on the network led to the loss of the GUI being used to carry out the upgrade. Investigate, and implement, an alternative method of connecting to the system being upgraded so as to allow for a reconnection in the event of a network break. no
A lack of logging information meant that it was not possible to see how far the upgrade had proceeded, and what (if any) problems had been reported by the updating process. Check, and enable where possible, logging for upgrade processes. no
Significant time was lost building another compatible (Oracle 10) system to validate the backups. A system should be available for the validation of backups, and not require a special build. This system should be available for regular validation of the backups of the Oracle databases. no

Reported by: Gareth Smith 22nd June 2012

Summary Table

Start Date 13 June 2012
Impact >80%
Duration of Outage 49.5 hours
Status Open
Root Cause Network
Data Loss Partial