RAL Tier1 Incident 20091004 Castor database disk failure
Failure of Disk Subsystem Underneath Castor Databases
Incident Date: 2009-10-04
Severity: Tier 1 Disaster Management Process Not Triggered.
Service: Storage (Castor)
Impacted: All local VOs
Incident Summary: A failure of the disk system used to host the Castor Oracle databases. This is a completely redundant system and following problems with one of the disk subsystems that hosted the second (mirrored) copy of the databases, a similar problem manifested itself in the other (production) disk subsystem.
Type of Impact: Down
Incident duration: 44 hours
Report date: 2009-10-05
Reported by: Gareth Smith, Tiju Idiculla
Status of this Report: Open
There have been some intermittent problems with one of the dual redundant disk subsystems that host the Castor Oracle databases. Both disk subsystems are connected by a redundant Storage Area Network to the Oracle RAC nodes. The first of these errors occurred on 10th September 2009 when the loss of access to the mirror set of disks caused Oracle to hang and interrupted the Castor service. Since then Oracle has been updated to fix the bug that caused it to hang rather than failing over to use the one remaining accessible disks subsystems.
Since that date we had seen some recurring problems on the faulty disk subsystem. These were being investigated with the hardware engineer. As a result of these ongoing problems Oracle was only making use of one of the pair of disk subsystems. The disk systems were therefore no longer redundant. However, on Sunday 4th October this second disk subsystem failed with similar symptoms to the first, resulting in a complete loss of access to the Castor databases.
On Tuesday 6th October there was a failure of both disk subsystems that host the Oracle databases for the LFC, FTS and 3D services. These disk subsystems are identical hardware to that host the Castor databases.
Following the failures of all of these identical disk subsystems the decision was taken to host the critical databases elsewhere while the problem is understood. Alternative hardware was identified, with different solutions for the Castor databases and LFC/FTS databases. Initially it was planned to leave the 3D databases in situ. However, it was subsequently decided to also move these elsewhere so as to enable the diagnosis of the underlying hardware problems to more easily be followed up.
The LFC/FTS databases were restored and these services restarted at 16:00 on Wednesday 7th October.
Difficulties were encountered restoring the Castor databases to their final status just before the failure on the Sunday. Following consultation with relevant experts a decision was taken to restore the 'Neptune' database (Castor stager databases for Atlas and LHCb) to its final available point (note: need to confirm this time.) The remaining databases ('Pluto' - nameserver and stager for CMS and GEN instances) to the moment before the failure on the 4th October. Castor was restarted at 17:00 on Friday 9th October, following significant checking of the state of the databases, in particular for problems that may have arisen owing to the differences between the restore points of the Neptune and Pluto databases.
The 3D databases were restored at 14:45 on Monday 12th October.
Having restored the Castor services on Friday 9th October it appeared to run without problems over the weekend. However, at the start of the week reports were received (initially from CMS) of missing files. It was subsequently realized that the restore of the Pluto database was not correct and was only valid up to 00:15:56 on Thursday 24th September local time (23:15:56 on Wednesday 23rd September UTC). All data from that point until the crash on 4th October was missing. This issue is being documented in a separate incident report (see https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091009).
Further tests have found significant noise on the electric current provided by the building UPS. A test of the bypass of the UPS made on 5th January 2010 confirmed the UPS as the source of this noise. Since then work has been undertaken to reduce the noise, with partial success.
The installation of a longer power feed from the UPS reduced the measured electrical noise. However, tests carried out from the beginning of July 2010 have indicated the noise reduction is not sufficient and the disk arrays still report power supply problems.
|Multiple failures across several independent systems. Cause needs to be understood.||The root cause of the problem has been understood and comes from noise on the electrical current from the UPS.|
|Resolve or mitigate noisy power supply from the UPS.||Work to reduce the the electrical noise has taken place. A longer power feed cable increasing the electrical inductance. This reduced the electrical noise but not sufficiently as the disk array power supplies have still reported errors. Further work is ongoing in this area both to investigate the practicality of powering the disk arrays from a combination if UPS and non-UPS power and further reducing the electrical noise. Note added following review on 28/06/11. Isolating transformers have been placed in the power feed to the disk arrays and this has resolved the problem.|
Related issues: None.
|First Realisation of a problem||2009-10-04||13:54:30||Email from Adrian - "The tape database appears to be down"|
|First announcement of Problem||2009-10-04||15:45:57||Ian announces to experiments and puts site in downtime|
|Announced as Fixed|
Incident details Timeline:
|2009-09-10||First instance of Database problems|
|2009-09-21||12:00 - 14:00||Database Team||AT_RISK to apply ASM patch|
|2009-09-24||0:22||Database failover with callout|
|2009-09-29||07:14||Database failover without callout|
|2009-10-01||13:30 - 16:30||Database Team||AT_RISK to apply ASM and Kernel patch|
|2009-10-03||18:20:10||OracleArray4 Logs||First sign of hardware issue|
|2009-10-04||Database hardware failure|
|2009-10-04||13:54:56||Adrian Sheppard||Sent alarm mail resulting in callout to Primary on-call|
|2009-10-04||14:08:41||Nagios||First callout to Primary-on-Call from nagger|
|2009-10-04||15:10:16||Nagios||Callout due to SAM test failure|
|2009-10-05||13:40||Database Team||Started to roll back database|
|2009-10-05||15:45||Tier1 Disaster Management Team||Meeting to invokde disaster management process.|
|2009-10-06||12:00:06||OracleArray3 Logs||First sign of hardware issue|
|2009-10-06||12:13:08||OracleArray3 Logs||Array failure|
|2009-10-06||12:36:23||OracleArray4 Logs||Array failure|
|2009-10-06||12:20 (approx)||Problem on other RAC (SOMNUS) that hosts the LFC and FTS databases seen|
|2009-10-06||13:00||Tier1 Disaster Management Team||2nd Meeting in disaster management process.|
|2009-10-07||11:30||Tier1 Disaster Management Team||3rd Meeting in disaster management process.|
|2009-10-07||16:00||LFC and FTS services back (downtime in GOC DB ends)|
|2009-10-08||11:00||Tier1 Disaster Management Team including GridPP representative||4th Meeting in disaster management process.|
|2009-10-09||17:00||Castor Team||Castor services restored.|
|2009-10-12||14:45|| 3D services restored downtime in GOC DB ends)
|2010-01-05||Test of UPS Bypass confirms noise on electrical current caused by UPS.|
|2010-08-12|| Some months ago the electrical noise level had been improved by the installation of a longer cable between the UPS and its load. However, tests over the last month with one of the disk array units having one power supply fed from the UPS indicate the noise has not been reduced sufficiently. The unit reporting errors at an average rate of about once per week.
|2010-12-02||All disk array units are being powered from both UPS and non-UPS power to mitigate the effects of both noise on the UPS supply and a power outage.|