RAL Tier1 Incident 20091004 Castor database disk failure

From GridPP Wiki
Jump to: navigation, search

Failure of Disk Subsystem Underneath Castor Databases

Site: RAL-LCG2

Incident Date: 2009-10-04

Severity: Tier 1 Disaster Management Process Not Triggered.

Service: Storage (Castor)

Impacted: All local VOs

Incident Summary: A failure of the disk system used to host the Castor Oracle databases. This is a completely redundant system and following problems with one of the disk subsystems that hosted the second (mirrored) copy of the databases, a similar problem manifested itself in the other (production) disk subsystem.

Type of Impact: Down

Incident duration: 44 hours

Report date: 2009-10-05

Reported by: Gareth Smith, Tiju Idiculla

Status of this Report: Open

Related URLs:

Incident Overview:

There have been some intermittent problems with one of the dual redundant disk subsystems that host the Castor Oracle databases. Both disk subsystems are connected by a redundant Storage Area Network to the Oracle RAC nodes. The first of these errors occurred on 10th September 2009 when the loss of access to the mirror set of disks caused Oracle to hang and interrupted the Castor service. Since then Oracle has been updated to fix the bug that caused it to hang rather than failing over to use the one remaining accessible disks subsystems.

Since that date we had seen some recurring problems on the faulty disk subsystem. These were being investigated with the hardware engineer. As a result of these ongoing problems Oracle was only making use of one of the pair of disk subsystems. The disk systems were therefore no longer redundant. However, on Sunday 4th October this second disk subsystem failed with similar symptoms to the first, resulting in a complete loss of access to the Castor databases.

On Tuesday 6th October there was a failure of both disk subsystems that host the Oracle databases for the LFC, FTS and 3D services. These disk subsystems are identical hardware to that host the Castor databases.

Following the failures of all of these identical disk subsystems the decision was taken to host the critical databases elsewhere while the problem is understood. Alternative hardware was identified, with different solutions for the Castor databases and LFC/FTS databases. Initially it was planned to leave the 3D databases in situ. However, it was subsequently decided to also move these elsewhere so as to enable the diagnosis of the underlying hardware problems to more easily be followed up.

The LFC/FTS databases were restored and these services restarted at 16:00 on Wednesday 7th October.

Difficulties were encountered restoring the Castor databases to their final status just before the failure on the Sunday. Following consultation with relevant experts a decision was taken to restore the 'Neptune' database (Castor stager databases for Atlas and LHCb) to its final available point (note: need to confirm this time.) The remaining databases ('Pluto' - nameserver and stager for CMS and GEN instances) to the moment before the failure on the 4th October. Castor was restarted at 17:00 on Friday 9th October, following significant checking of the state of the databases, in particular for problems that may have arisen owing to the differences between the restore points of the Neptune and Pluto databases.

The 3D databases were restored at 14:45 on Monday 12th October.

Having restored the Castor services on Friday 9th October it appeared to run without problems over the weekend. However, at the start of the week reports were received (initially from CMS) of missing files. It was subsequently realized that the restore of the Pluto database was not correct and was only valid up to 00:15:56 on Thursday 24th September local time (23:15:56 on Wednesday 23rd September UTC). All data from that point until the crash on 4th October was missing. This issue is being documented in a separate incident report (see https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091009).

Further tests have found significant noise on the electric current provided by the building UPS. A test of the bypass of the UPS made on 5th January 2010 confirmed the UPS as the source of this noise. Since then work has been undertaken to reduce the noise, with partial success.

The installation of a longer power feed from the UPS reduced the measured electrical noise. However, tests carried out from the beginning of July 2010 have indicated the noise reduction is not sufficient and the disk arrays still report power supply problems.

Future mitigation:

Issue Response
Multiple failures across several independent systems. Cause needs to be understood. The root cause of the problem has been understood and comes from noise on the electrical current from the UPS.
Resolve or mitigate noisy power supply from the UPS. Work to reduce the the electrical noise has taken place. A longer power feed cable increasing the electrical inductance. This reduced the electrical noise but not sufficiently as the disk array power supplies have still reported errors. Further work is ongoing in this area both to investigate the practicality of powering the disk arrays from a combination if UPS and non-UPS power and further reducing the electrical noise. Note added following review on 28/06/11. Isolating transformers have been placed in the power feed to the disk arrays and this has resolved the problem.

Related issues: None.

Timeline

Date Time Comment
Actually Started
First Realisation of a problem 2009-10-04 13:54:30 Email from Adrian - "The tape database appears to be down"
First announcement of Problem 2009-10-04 15:45:57 Ian announces to experiments and puts site in downtime
Problem resolved
Announced as Fixed

Incident details Timeline:

Date Time Who/What Entry
2009-09-10 First instance of Database problems
2009-09-21 12:00 - 14:00 Database Team AT_RISK to apply ASM patch
2009-09-24 0:22 Database failover with callout
2009-09-29 07:14 Database failover without callout
2009-10-01 13:30 - 16:30 Database Team AT_RISK to apply ASM and Kernel patch
2009-10-03 18:20:10 OracleArray4 Logs First sign of hardware issue
2009-10-04 Database hardware failure
2009-10-04 13:54:56 Adrian Sheppard Sent alarm mail resulting in callout to Primary on-call
2009-10-04 14:08:41 Nagios First callout to Primary-on-Call from nagger
2009-10-04 15:10:16 Nagios Callout due to SAM test failure
2009-10-05 13:40 Database Team Started to roll back database
2009-10-05 15:45 Tier1 Disaster Management Team Meeting to invokde disaster management process.
2009-10-06 12:00:06 OracleArray3 Logs First sign of hardware issue
2009-10-06 12:13:08 OracleArray3 Logs Array failure
2009-10-06 12:36:23 OracleArray4 Logs Array failure
2009-10-06 12:20 (approx) Problem on other RAC (SOMNUS) that hosts the LFC and FTS databases seen
2009-10-06 13:00 Tier1 Disaster Management Team 2nd Meeting in disaster management process.
2009-10-07 11:30 Tier1 Disaster Management Team 3rd Meeting in disaster management process.
2009-10-07 16:00 LFC and FTS services back (downtime in GOC DB ends)
2009-10-08 11:00 Tier1 Disaster Management Team including GridPP representative 4th Meeting in disaster management process.
2009-10-09 17:00 Castor Team Castor services restored.
2009-10-12 14:45 3D services restored downtime in GOC DB ends)


2010-01-05 Test of UPS Bypass confirms noise on electrical current caused by UPS.
2010-08-12 Some months ago the electrical noise level had been improved by the installation of a longer cable between the UPS and its load. However, tests over the last month with one of the disk array units having one power supply fed from the UPS indicate the noise has not been reduced sufficiently. The unit reporting errors at an average rate of about once per week.


2010-12-02 All disk array units are being powered from both UPS and non-UPS power to mitigate the effects of both noise on the UPS supply and a power outage.