RAL Tier1 Incident 20121107 Site Wide Power Failure

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 07th November 2012: Site Wide Power Failure

Description:

A power outage affected the whole of the RAL site including the building housing the Tier1 equipment. Core equipment was maintained temporarily on UPS power but the diesel generator that would have provided longer term temporary power failed to cut in. All systems were therefore down.

Impact

National critical services (FTS, BDIIs, LFC, WMS) unavailable 4-6 hours. Main capacity service unavailable for about 27 hours.

Timeline of the Incident

When What
7th November 11:23 Site wide power outage, confirmed by R18. Operations emergency procedures in use.
7th November 11:23 R89 generator fired up. Approximately no more than 30 seconds after starting generator stops. R18 informed
7th November 11:24 Fabric staff already in machine room alerted by other team members that the generator went silent shortly after they had left office for the machine room
7th November 11:25 Shutdown of ADS/DMF/Castor tape servers started. Not all complete when UPS power lost
7th November 11:34 CIC Broadcast notifying about Power Outage. Site put in 24 hours outage in GOC DB (until midday on 8th Nov).
7th November 11:30 Operations decided to hit all EPOs in the LPD and HPD to safe guard any power surge/damage. PDUs D2/1 and D2/2 untouched in case generator is restored. In HPD 2011 generation disk servers remain up as their racks have one power strip on UPS.
7th November 11:30 R18 in attendance to start up generator. R18 unable to clear generator alarm and in contact with company.
7th November 11:30 (Approx) Tier1 staff learnt that power outage affects whole site (Andrew lent out of window and asked electrical people pasing by)
7th November 11:35 2011 disk servers have been properly shut down by pressing the power button to initiate software shutdown.
7th November 11:37 Tier-1 Manager notifies (text) GRIDPP PMB of site wide power failure
7th November 11:40 Due to low battery load and problems with generator, decided to hit the EPOs to D2/2 and D2/1. No all Tier1 systems shutdown (e.g. not all databases).
7th November 11:45 Generator started up and unstable. HP decided to leave D2/1 & D2/2 off.
7th November 12:15 R26 A5L Computer room distribution powered off and Platform Managers informed (PM).
7th November 12:20 Mike Ashworth (Officer in Charge Scientific Computing) phones to request status update
7th November 13:15 Site wide power available but unstable. Advised PM and decided to leave all PDU off. Shortly (minute or so) afterwards diesel generator stops.
7th November 13:15 Peter Gronbech sends Broadcast "RAL Tier 1 Power Outage update" on our behalf.
7th November 13:26 Tier-1 Manager notifies (text) PMB power is restored
7th November 13:45 Castor manager sends tweet @RALTier1 stating all services down.)
7th November 14:10 R18 BPG confirms power is stable. UPS room PDU D2/1 & D2/2 powered up. Some racks in HPD/LPD and Networks racks now powered up.
7th November 14:10 R18 Estates concern with Chillers / pumps problem. HP decided to leave PDUs in HPD/LPD off until further advice.
7th November 14:15 R18 Estates confirms Chillers and 3 out of 4 pumps working. Safe to restore power.
7th November 14:20 Risks reviewed and decided it was safe to advise Platform Managers to power up systems in the UPS room.
7th November 14:20 All PDUs in the HPD/LPD powered up.
7th November 14:25 Managers informed that power was restored and safe to power up their systems.
7th November 14:55 Problems getting some part of the network back - looks like Router A problems.
7th November 15:00 All Hyper-V hypervisors in UPS room running. Network problems mean not all VMs properly connected to network. Also two hypervisors not manageable through SCVMM.
7th November 15:10 Andrew Taylor (Executive Director of the National Laboratories) visited to check status of Tier-1 and provide update from site Disaster Control meeting.
7th November 15:35 Tier1 manager speaks with site networking (Nick Moore) who re-enables Tier1 connections to Router A and UKLight router. Network connectivity restored.
7th November 16:05 LFC service up.
7th November 16:10 Numbers of services now up. ATLAS Conditions database and Frontier launchpads start working again.
7th November 16:20 External e-mails start arriving.
7th November 16:30 Whole team assessment meeting. Agree to start disks servers ready for restarting Castor in the morning. Only one of Site BDIIs working - agree to request alias change.
7th November 16:45 IPC confirms all Hyper-V hypervisors are up and manageable through SCVMM. (Should have been followed by checking any production services.)
7th November 17:07 Disk servers started
7th November 17:30 Robot controller (acsls) brought online.
7th November 19:00 After Engineer had attended site the SL8500 was back on line. Handbot replaced (but not cause of problems)
7th November 19:12 Production Manager puts out dashboard update (with tweet) and sends an e-mail to many recipients detailing the current situation.
7th November 19:30 ADS and DMF systems were brought back on line and made available to uses for restoration of system backups if needed.
7th November 19:44 Production Manager sends Broadcast with status update.
8th November 09:00 Whole team assessment meeting.
8th November 09:35 Found many (up to ~3000) batch jobs running. Reservation set on batch farm to stop any more job starts.
8th November 10:30 WMS service manager confirms WMSs running.
8th November 10:45 Castor for Facilities up.
8th November 10:50 Assessment of problematic disk servers completed and passed to Fabric team with a priority order for investigation. (11 servers on list).
8th November 11:05 Extend outage for castor & Batch to 16:00. Update dashboard.
8th November 11:15 Helpdesk (RT) back in operation.
8th November 11:30 E-mail sent with status update to users. Nagios up and running but not yet reporting anything (not sending any e-mails out).
8th November 11:30 FTS down (although it was up at the end of yesterday).
8th November 11:40 Database team completed Castor DB checks - hand over to Castor team for them to start testing castor.
8th November 11:40 Kill off the batch jobs that had been started accidentally.
8th November 12:03 Fabric Team report that all the disk servers in the list provided to them are working again.
8th November 12:35 Restart of FTS front end VMs, along with Tomcat restart on FTS agents. (Fixes FTS problem).
8th November 13:20 Castor just undergoing final checks.
8th November 14:00 Ending remaining outage in the GOC DB(Castor & Batch)
8th November 14:00 Opened up FTS to/from RAL Tier1 (channels to 50% to start)
8th November 14:20 Re-started batch system.
8th November 14:25 Four of the previously problematic disk servers have given problems (on different partitions). Request removal from production.
8th November 14:25 Atlas Frontier Squids working.
8th November 19:00 Last of the batch nodes that had developed problems (file system, BMC, PSUs) brought up

Incident details

At 11:23 on Wednesday 7th November there was a power outage that affected the RAL site. Core services that were powered by the UPS remained up. The diesel generator that should provide temporary power for a longer time started but then cut out. As the UPS batteries could only keep systems running for a short time (tens of minutes) a start was made to shutdown those systems on UPS. However, with limited power available the UPS power was manually cut at 11:30 by which time most systems on UPS had not been shutdown.

Power was restored to the site at 13:15 but no Tier1 systems were restarted until it was confirmed the power was stable at 14:25. Core services (Top BDII etc) were brought up during the afternoon. Storage (Castor) and Batch services were restored the following morning - with the final outage being ended in the GOC DB at 14:00 on the 8th November.

Analysis

The root cause of the site power outage is not the focus of this review. However, it is noted that work was ongoing on the switchgear for the power onto site, with one set (out of two) being maintained at the time. The tier1 was aware of a higher chance of a power outage during this time (which spanned many months).

The failure of the diesel generator to pick up the power generation led to the site power loss having a far greater impact on services than expected. Had the diesel generator started correctly the core services would have stayed up. Furthermore database systems were not shut down cleanly before the UPS power was cut. This risked data loss from those system which would have necessitated recovery from backup. Fortunately the databases restarted correctly. Although the recovery took a long time, it could easily have taken significantly longer had any of the databases been corrupted.

The work to restart the Tier1 went reasonably well. However, there was significant room for improvement. This was the first 'cold start' of the complete Tier1 for about three years. Procedures, whilst in place, required some updating on the fly and there was room for improving the start sequence. Examples of delays included the time taken to re-establish the network link from the Tier1 core to the rest of site and having batch work start before the Castor storage was available.

Follow Up

No specific follow-up actions are being generated from this Post Mortem review. This incident was followed by another power incident only two weeks later. This second incident provided a second run-through of the start sequence and the actions from that review (see link below) superseded any from this one.

An analysis of the failure of the diesel generator to take over the load after the power cut has already taken place. The generator had not been tested sufficiently frequently enough, and in particular it had not been tested following a change to add more equipment requiring its power. The cause of the cut-out has been understood and fixed and the diesel generator subsequently tested under load.

Related issues

Around two weeks after this incident the RAL Tier1 suffered another power incident. Many of the issues that arose from the incident reported here were picked up in the following incident which can be seen at RAL Tier1 Incident 20121120 UPS Over Voltage

Reported by: Gareth Smith. 4th February 2013

Summary Table

Start Date 7th November 2012
Impact 100%
Duration of Outage 27 hours
Status Open
Root Cause Power Supply
Data Loss 166 ATLAS files, 1 ILC file. All in transit when power was lost.