RAL Tier1 Incident 20101201 Power Outage

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 1st December 2010: Short Power Outage led to loss of service.

Description:

The RAL Tier1 suffered a very short power outage at around 13:30 on Wednesday 1 December. Most services continued uninterrupted but a number of disk servers and batch worker nodes tripped off. Owing to the time taken to recover some disk servers, most Castor end points were in outage until the following morning.

Impact

The power interruption affected some disk servers and batch worker nodes. Core grid services are powered by a UPS and stayed up. Around 20% of batch capacity (and therefore running jobs) was lost. The main impact was that most of Castor was declared as in an Outage for around 20 hours. During this time Castor was running but there was a significant number (initially around 25 to 30%) of disk servers unavailable. The exception was LHCb where all disk servers were available within three hours of the power dip.

Timeline of the Incident

When What
1st December 13:30 Short power outage - noticed by Tier1 staff.
1st December 13:40 Set site in Outage in GOC DB while assessments made. Updated dashboard.
1st December 14:30 Assessment shows most services, including tape systems, up. Aware of numbers of disk servers down - many running a file system (fsck) check.
1st December 16:10 Ended Outage of site in GOC DB. Add a GOCDB entry for all srm endpoints (except LHCb) in Outage until 10am the next morning. srm-lhcb put into a state Warning for this time as all its disk servers were already back up.
2nd December 09:10 8 disk servers affected by power cut still out of production. These are being checked out individually.
2nd December 09:40 Two of the remaining four Atlas disk servers back in production.
2nd December 10:00 Allow GOC DB Outages & Warning to expire.
2nd December 16:30 All bar one of the affected disk servers back in production.
9th December 10:00 Final disk server (gdss77 - CmsFarmRead) returned to production.

Incident details

At 13:30 on the 1st December Tier1 staff noticed a brief power dip. An initial investigation showed a number of systems tripped off. The site was declared as being in an Outage while investigations took place. It became clear that all services stayed up with the exception of some disk servers and batch worker nodes.

The batch worker nodes were quickly brought back on line. However, many of the disk servers ran a file system (fsck) check which took many hours. In a significant number of cases these servers did not complete their file system checks until the evening. A small number (eight) of disk servers had specific problems that required further investigation.

There was no loss of hardware (power supplies etc.)

The details of the incident were tracked internally within the Tier1 in RT ticket #69959.


Analysis

The cause was the failure of a 11kV underground cable joint on the RAL site. Only systems on Phase A were affected. Systems powered from other phases or on the building UPS were unaffected. This includes core systems such as the LFC front end, BDIIs etc.

A significant number of Castor disk servers were affected. These do not have UPS power and on reboot those with the 'ext3' file system carried out a file system check (fsck). This caused a significant delay in returning these servers to use. Newer servers use the XFS file system. This does not require the regular file system check. However, as this was the first such power incident since deploying servers with XFS some checksums were validated on an XFS server as a precautionary measure.

A portion of the batch capacity (around 20%), with the currently running jobs on those nodes, was lost as the worker nodes restarted.

Only one phase (A) of the power supply was affected. Consideration should be given to the effect of the failure of either of the other phases on Tier1 operations.

Follow Up

Issue Response Done
Disk Servers susceptible to power outages Implement the planned dual-powering of disk servers from both UPS and non-UPS power so as to enable the servers to either remain up during a short power outage or to execute a controlled shut-down on power failure. No
File servers with Ext3 file systems checking (fsck) on boot. At the start of the 2010 LHC run the disk servers had been set not to routinely fsck on boot for the maximum configurable time. However, that had expired a few days before this power outage. This will no longer be necessary once systems have moved to XFS. However, those systems remaining with Ext3 need to have this parameter checked to ensure this does not recur during 2011 running. No
Consider the effect of such an incident affecting other power phases. A review has been carried out of the effect of a failure of any of the power phases on the Tier1 systems with consideration given to mitigation by moving to UPS (or dual power with one side on UPS) for these cases. The disk arrays hosting the Oracle databases have all been moved to dual (UPS and non-UPS) power since this incident. The only other category of system identified by the review are the Castor head-nodes. A further subsequent review has verified that the Castor head nodes are powered as expected. Yes


Reported by: Gareth Smith. Wednesday 8th December.

Summary Table

Start Date 1st December 2010
Impact >50%
Duration of Outage 20 hours
Status Open
Root Cause External Power Supply Failure
Data Loss No