RAL Tier1 Incident 20101231 PDU Problem

From GridPP Wiki
Revision as of 13:07, 4 July 2011 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Incident 31st December 2010: PDU Shutdown Short Service Outage.

Description:

Staff attending site on 31st December noticed a problem with one of the Power Distribution Units which was overheating. The unit was turned off. Tier1 equipment is dual powered and carried on running. Concern about the disk arrays hosting Oracle databases led to a short outage the following morning.

Impact

Castor, LFC, FTS and 3D services were unavailable for around 3.5 hours on Saturday 1st January 2011.

Timeline of the Incident

When What
31st December 2010 16:52 Formal notification (pager) from Operations Staff to Tier1 (Primary On-Call) that an emergency shutdown had been carried out on PDU G6 in LPD room.

A telephone notification had also been received shortly ahead of this.

31st December 2010 17:15 Short halt on Castor SRMs while state of power assessed.
31st December 2010 17:25 Resume SRMs agree OK to carry on. (Primary & castor On-call).
1st January 2011 00:38 E-mail from Fabric Manager alerting Primary On-Call to possible risks to data arrays hosting Oracle databases.
1st January 2011 09:00 E-mail exchange between Fabric Manager and Primary On-Call, followed by phone conversation.
1st January 2011 11:30 Start of unscheduled outage in GOC DB.
1st January 2011 11:55 Shut-down of Services that use Oracle databases (Castor, LFC, FTS, 3D) followed by shut-down of databases themselves.
1st January 2011 12:15 Cabling of Disk arrays behind Oracle databases checked. Dual powering re-established for arrays hosting LFC, FTS & 3D databases.
1st January 2011 14:30 Services coming back up.
1st January 2011 15:00 End of unscheduled outage in GOC DB.
10th January 2011 Re-balancing (re-synchronization) of disk arrays hosting LFC, FTS & 3D databases completed.

Incident details

On 31st December a member of staff attending site to make a planned check of the computer room noticed a problem with one of the Power Distribution Units which was overheating. The unit was turned off. Tier1 On-Call staff were contacted. A check on the Tier1 equipment, which is dual powered in many cases, indicated that all relevant services carried on running. Primary On Call decided that systems could be left up, although an 'At Risk' was declared in the GOC DB until the Tuesday (the first day back at work following the holiday.)

On the morning of Saturday 1st January 2011 the Tier1 Fabric manager, who had not been involved in the incident the previous day, expressed concern about the state of the disk arrays that host the Oracle databases. In some cases these were no longer dual powered. The Tier1 Fabric Manager and Primary On Call agreed that services would be stopped until an assessment had been made on site. This was done. It was found that the disk arrays hosting the Oracle databases behind the LFC, FTS and 3D services, were no longer dual powered. This was rectified and subsequently services were re-started. (The arrays hosting the Castor still had dual power). A total of 3.5 hours of outage was declared.

During the incident the Oracle system for the LFC, FTS & 3D databases had lost contact with one of the pair of disk arrays. It was subsequently necessary to re-balance (re-synchronize) these databases across the pair of arrays.

Analysis

This problem occurred during the Christmas and New year Holiday. During this time the Tier1 runs its normal out-of-hours on-call process, supplemented by regular daily checks of systems. The machine room operations team also run their own own call-out system and attended site on some days to check systems.

The problem found was an overheating of a power conditioning unit in one of the Power Distribution Units in the 'LPD' room. When the affected PDU was turned off Tier1 equipment, which in that room is dual-powered, carried on running. This enabled services to carry on running. Monitoring displays showed that only a handful of non-production Tier1 systems were no longer running.

The disk arrays that host the Oracle databases are dual-powered from the UPS and mains power. This arrangement is a workaround for a problem whereby the disk arrays are sensitive to electrical noise from the UPS. The disk array's have two power units. Should one be affected by the electrical noise from the UPS (as is seen to happen with a frequency of around weekly) the unit stays functioning on the other power supply. Following the switching off of the PDU, the disk arrays hosting the LFC, FTS & 3D databases were no longer dual powered and were left sensitive to the electrical noise from the UPS. On the morning of 1st January the power status of the disk arrays was not known with certainty. All services that use the databases on these arrays were stopped. Staff attending on site verified, and resolved, power problems for the arrays, and services were restarted.

Since the problem temperature sensors have been installed to monitor these PDUs. Checks on all the systems revealed a fault in the wiring to the cooling fans that has been resolved.

Follow Up

Issue Response Done
Reliance on Manual Check to Uncover the overheating of the PDU. Introduce some automated monitoring of these temperatures. yes
Could other PDUs fail in this way? Assess if the other PDUs are vulnerable to the same failure. yes
Systems left running with disk arrays in a vulnerable state. Ensure Tier1 Fabric manager always consulted when power problems. yes
Disk Arrays At Risk when running only on UPS power. Resolve issue of Disk Arrays sensitivity to electrical noise from UPS. yes

Reported by: Gareth Smith Wednesday 5th January 2011'

Summary Table

Start Date 31 December 2010
Impact >80%
Duration of Outage 3.5 hours
Status Closed
Root Cause Hardware
Data Loss No