RAL Tier1 Incident 20110510 LFC Outage After DB Update

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 10th May 2011: LFC Outage After Database Update

Description:

Following a planned update of the Oracle databases behind the LFC and FTS services the client applications were unable to connect to the database. This was traced to a problem with Oracle ACLs. The particular ACLs were removed and the services resumed after an outage of about 1 hour.

Impact

The LFC services lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk and FTS service lcgfts.gridpp.rl.ac.uk were unavailable for an hour.

Timeline of the Incident

Times are local time (BST).

When What
10th May 12:00 Start of scheduled Warning/At Risk in GOC DB
10th May 12:05 Start of updating Oracle database.
10th May 12:30 First Nagios failure (indicating problem with FTS).
10th May 13:00 End of updating Oracle database.
10th May 13:00 End of updating Oracle database.
10th May 13:12 Notification received of arrival of GGUS Alarm ticket from Atlas.
10th May 13:12 Oracle ACLs list removed.
10th May 13:23 Nagios tests clearing.
10th May 13:23 GGUS Alarm ticket responded to and marked as solved.
10th May 14:00 End of scheduled Warning/At Risk in GOC DB

Incident details

The application of regular patches to the Oracle databases behind services at the RAL Tier1 is a standard procedure. This patch had been applied to Castor Oracle databases the day beforehand and the 3D Oracle databases earlier that day. No problems were encountered in either case.

The application of the patch to the Oracle database behind the LFC and FTS services (called SOMNUS) proceeded as planned. However, after restarting the Oracle instances the client applications (LFC & FTS) were unable to connect to the database. This was traced to a problem with some ACLs on the database. These were removed and the applications were again able to connect and the services resumed.

A GGUS ticket (#70435) was received from Atlas.

Analysis

The application of the patch proceeded satisfactorily. The problems occurred afterwards when the database was restarted and the client applications could not connect.

The first Nagios alerts were not acted upon quickly as it was known that work was proceeding with the database updates. There was an initial assumption of a transitory problem.

The problem was resolved by removing the Oracle ACLs from the database. The ACLs were used to control which nodes can connect to the database. These had been introduced in July 2010 when a separate FTS system was being tested. Although each database has a unique password which is used to control access, the ACLs provide an additional check to ensure that test systems cannot connect to the production database in error.

However, when the Oracle database starts it validates the list of node names in the ACL 'allow' list. If any of the nodes in the list cannot be resolved to an IP address the Oracle listener does not start. If the listener is not running, connections to the database cannot be made.

At the time of a planned network intervention in March there had been a problem with connections to this database. A check was made of which nodes were in the ACL list at that time and any inaccessible nodes (as determined by a 'ping' test) removed. However even though the network intervention was finished, it was not realised that some valid nodes could not be accessed temporarily and these were removed from the ACL list in error. As the databases were up at this time the mistake had no immediate impact on operations. Following this change to the ACLs a validation of the revised ACL list was started but had not been completed by the time of this incident.

At the time of the problem being reviewed here this ACL list came into effect. As a number of the client nodes were missing they were unable to connect to the database. The problem was resolved by removing the ACL list completely.

Changing the ACL list to be based in IP addresses rather than DNS names removes the requirement that Oracle validates all the DNS names before allowing any connections in.

Follow Up

Issue Response Done
Root Cause Root Cause understood. yes
Out of date Oracle ACLs containing DNS names block database connections. The ACLs using DNS names have been removed. Although the databases are protected by unique passwords the ACLS should be restored in this case but using IP addresses not DNS names. yes
Check if there any Oracle ACLs applied that could cause this problem elsewhere. There are no ACLs applied on other databases. yes
Problems created by the changes to the Oracle ACL list in March were not picked up and checked. Update procedures such that relevant changes made by the Database Team are added to the Tier1 ELOGger so that they become visible to the wider team. yes

Reported by: Gareth Smith. 12th May 2011

Summary Table

Start Date 10 May 2011
Impact >80%
Duration of Outage 1 hour
Status Closed
Root Cause Configuration Error
Data Loss No