RAL Tier1 Incident 20121120 UPS Over Voltage

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 20th November 2012: Error During Electrical Intervention led to Over-Voltage to Equipment on UPS Power.

Description:

During scheduled work on the UPS electrical system an error (neutral disconnected) led to an over-voltage being delivered to equipment connected to the UPS supply. Significant damage was sustained, mainly to PDUs and power supplies.

Impact

All Tier1 services were unavailable for around 24 hours. After this time core services (BDII, FTS, LFC) were restored. Castor storage and batch services were unavailable for around 50 hours. Batch services were reduced for around a week. The tape capacity was reduced (to around 50% of drives available) for the following week. There was no significant data loss although a few files that were in the process of being written at the time of the incident were lost.

Timeline of the Incident

When What
20th November 12:20 Major electrical incident during UPS work. Power lost in operations, first stage fire alarm activated at 12:22.
20th November 12:24 Text sent by Fabric Team staff in computer room to Fabric Team leader "PDUs in UPS room blown..."
20th November 12:25 Smoke in LPD and UPS rooms. Operations hit Emergency Power Off in both rooms
20th November 12:27 Text sent by Fabric Team staff in computer room to Fabric Team leader "Everything off. Smoke is coming out of pdus in ups."
20th November 12:27 Text sent by Fabric Team staff in computer room to Fabric Team leader "Evacuating building."
20th November 12:30 Security in attendance and activate building fire alarm to clear the building
20th November 12:49 Tier-1 Manager responds to Fabric Team Leader's text alert, returning from lunchtime run.
20th November 13:06 Tier-1 Manager txts PMB "Major power incident with smoke and pdu damage. May need to phone meet once we know state of play"
20th November 13:15 Tier-1 Manager receives callback from GRIDPP Project Leader. Provides status update.
20th November 13:15 John Gordon emails Peter Gronbech "We've had a fire incident in the UPS room so the T1 network and possibly the databases are down. Please broadcast if none yet sent"
20th November 13:30 Access to building restored but no building network and power off in LPD and HPD rooms
20th November 13:45 Tier-1 manager sets team priorities to be careful recertification of hardware not rapid restart. Fabric Manager responsible for recertification and keep startup tightly controlled.
20th November 13:58 Tiju sends EGI broadcast "Due to problems with the electrical supply to our data centre, all services at RAL Tier1 are currently unavailable."
20th November 14:00 RAL (Tiju) report at the WLCG daily meeting "emergency shutdown due to electrical problem. No time estimate for the end of the downtime. This is in response to a statement by ATLAS that there had been a fire at RAL"
20th November 14:21 Pete Gronbech sends EGI broadcast "The RAL Tier 1 centre in the UK has had a fire incident in the UPS room so the T1 network and possibly the databases are down"
20th November 14:30 Tier1 staff not involved in the immediate incident review the start-up sequence.
20th November 15:20 Power down of disk servers.
20th November 15:50 UPS Power restored although no air conditioning in the UPS room. (Rack PDUs still off.)
20th November 16:00 Power down of batch worker nodes.
20th November 16:30 Tier1 Core network switch (C300) in LPD room back up.
20th November 17:36 John Kelly sends EGI broadcast "As reported earlier, there has been a power failure at RAL. We have just stabilized the power situation and we are evaluating the state ..."
20th November 16:00 With no air conditioning in UPS room staff leave.
20th November 18:37 Tier-1 Manager sends status update to PMB and TB-SUPPORT
20th November 22:06 Tier-1 Manager provides update to Directors (SCD and PPD)
21st November 08:50 2 out 2 of 3 CRACs now working in the UPS room; 7 of 20 APCs in the UPS room are blown; Tape robot has suffered a blown board (one out of four); C300 has some failed PSUs but is working
21st November 10:21 Production Manager e-mails status update to wlcg-operations.
21st November 10:36 Still replacing PDUs; UK CA is reported as up and running from Daresbury; GOCDB is still down.
21st November 12:00 Disaster management assesment Meeting 1 (and update to PMB and directors)
21st November 12:38 Top-BDII up on 2 machines (and DNS changed);

Maia Data base - data seems to be OK; non-castor database (ie Somnus, etc) Fibre Channel switches have both failed; The Somnus EMC data array seems to be OK at first glance.

21st November 12:55 Maia database up and seems OK; DMF is up but not yet talking to the tape system; ADS rack is up but one PDU blown. The ADS catalogue disk array has a memory error. All the tape robot FC switches are OK.
21st November 13:09 Hypervisor manager machine is up and OK. Starting to powering on the HyperVisor machines. Following machines are up: NIS, NFS, Touch, quattor01
21st November 13:52 nagger, postie, lcgwww and helpdesk front-end are all up. (Note helpdesk is not actually working as the database backend is still not up.)
21st November 14:24 FTS machines being started; myproxy systems up.
21st November 14:52 GOC DB available - Outages declared: For Tier1 site except storage & batch: Start back dated to 20-11-2012 at 12:18 - end on 21-11-2012 at 16:00.

For storage & batch: Start back dated to 20-11-2012 12.18.00 - end on 22-11-2012 late afternoon.

21st November 15:15 Castor primary database. (It has 2 Fibrechannel switches in the system

each with 2 PSUs. At the moment there is only one of four PSUs working.) Database seems OK; Castor secondary database is in similar state; Helpdesk up; ganglia03 coming up; AFS service (all 3 machines) up; install02 up.

21st November 15:32 Castor databases all confirmed OK but not yet opened up for connections.
21st November 15:53 FTS database is running on Somnus - so FTS should start

now. LFC service available. DMF service in production.

21st November 16:00 Team Review Meeting: Focus work on OGMA (Atlas 3D); Will now restart the disk servers; Shaun to generate a start order for castor head nodes; LPD room networking still being worked on; Working to start turning on power in LPD room.
22nd November 09:00 Review Meeting with whole team.
22nd November 10:16 Status: UIs up, ADS up; Networking in LPD room available; Tape Robot has around half its drives available; All disk servers up except 2007 AMD Viglen 11 disk servers need special

treatment as they were powered from the UPS. Network issues being investigated. OGMA not up.

22nd November 10:33 Alice & LHCb VO boxes up.
22nd November 11:13 Tier1 Castor headnodes not yet up; Facilities castor headnodes up; Most disk servers available (all except 44 for Tier1 and all Facilities).
22nd November 11:18 Network stacks 10 and 11 are still causing networking problems to CASTOR disk servers.
22nd November 11:30 Team Review Meeting: Engineer is sourcing parts for the tape system; Castor head nodes can be turned on. Network stacks have been worked on (Stack 10 reset; Stack 11 has had one switch reset; Stack 4 has problems but appears to be working; Stack 13 is not resilient - one interlink is down.)

OGMA still not available.

22nd November 13:10 All 4 tier1 Castor instances check out OK.
22nd November 13:45 Tape servers undergoing final checks. Three Tier1 Castor disks servers not yet available.
22nd November 14:10 All Tier1 disk servers ready.
22nd November 14:40 Castor downtime ended in GOC DB. Outage declared for batch until 17:00. At risk ('warning') added for whole site until 23rd Nov at 17:00.
22nd November 16:00 DM review meeting 2. Service coming up. Remaining at level 2 with 30% risk of escalation to level 3 owing to lack of resiliance and potential for further complications.
22nd November 16:17 Starting to enable worker nodes. re-installing 2010 and V 2011 machines as emi-2.
22nd November 17:01 Three clusters of batch workers back online. Passing SUM tests.
22nd November 17:09 Update to PMB and senior SCD staff
23rd November 09:00 Team Review Meeting: Number of problems with batch workers (AFS errors); Failing Atlas SRM tests (later found to be Atlas problem) but not yet updating gridmap files (need castoradm1). OGMA up with only one node. Network packet loss problems from yesterday investigated and seem to be fixed.

Viglen 2011 disk servers still have problems with the UPS supply. Suspect that one disk server has a failed PSU and it is blowing fuses in the UPS supply when turned on. One disk server turned itself off last night.

28th November Batch service returned to full capacity.
4th December Replacement power supplies for fibrechannel SANs arrive and put into service restore resilience for the Castor database infrastructure.

Incident details

On Tuesday 20th November planned work was undertaken on a power distribution board fed from the UPS supply to the machine room in building R89 which houses the Tier1. In order to avoid a power down of all equipment connected to the UPS an alternative power supply was cabled in ahead of the work being undertaken. An error was made during the establishment of this alternative supply with some neutral cables not being connected. This resulted in an over-voltage being experienced by equipment connected to UPS supply and many components suffered an immediate failure. This occurred around 12:20. The sound of the failures could be heard, and some smoke (from the failure of electrical components) was seen. Site security personnel arrived rapidly and activated the fire alarms to evacuate the building. Local machine room operations staff used the Emergency Power Off buttons to shut-down both the UPS and LPD rooms (both of which contain equipment connected to the UPS supply).

After the incident rack level circuit breakers were switched off and other equipment in the HPD room powered down. Following investigation power was restored in the UPS & LPD rooms at 15:50. However, the air conditioning units in the UPS room had suffered damage and did not start. The air-conditioning in the room was not re-established until around 18:30. Tier1 fabric team concentrated on re-establishing the core Tier1 network (the C300 switch) located in the HPD room. This was achieved using power supplies from the standby unit at around 16:30. In parallel the maintenance engineer for the tape robot came on site and during the afternoon succeeded in getting the robot working again. However, by the end of the day the amount of damage to the Tier1 equipment powered from the UPS supply remained unknown.

On Wednesday (21st Dec) the process of powering on and assessing the state of equipment in the UPS room continued. Work progressed cautiously and it was necessary to replace many Power Distribution Units. The network stack in the UPS room was brought up and core services could start to be established. The Top-BDII was the first service back in production at 12:38 The restarting of database systems was undertaken during the day. Although there had been failures in many of the power supplies for the fibrechannel SAN switches in the database infrastructure the main databases (Castor, LFC, FTS) were found to be intact. During the afternoon the infrastructure supporting virtual machines was started and key internal services (e.g. nagios) as well as a further set of published services (FTS, MyProxy, LFC) were started. Work started on re-establishing missing services in the LPD room, which contains some equipment also powered via the UPS. Following the loss of most of the switches in a network stack it was necessary to recreate the networking in this area.

On Thursday (22nd) work continued starting the services in the LPD room including the Castor headnodes. Castor services were restarted at around 14:30. Batch services were also restarted during the afternoon.

Following the restoration of services the Tier1 ran with a reduced batch service as nodes were progressively brought back on line. The failure of numerous power supplies in the tape robot required that the number of tape drives in use at any time were reduced by around 50%. Work then focused on obtaining replacements for the equipment (mainly power supplies and Power Distribution Units) that had been lost in order to regain resilience.

Analysis

The UPS supplies the Tier-1's most critical systems, the ones that are required to be kept running even during a major incident. To suffer widespread damage in this critical infrastructure was a major blow to the service and presented a considerable challenge for the team to promptly restore service. It was remarkable considering the extent of the damage that critical services were restarted within about 24 hours and the full service was operational (at reduced capacity) within 48 hours. The ingenuity of the team, the considerable level of resilience built into the critical core services and the availability of spare parts scavenged from a variety of sources allowed the team to piece together the service about as quickly as might be considered possible.

It should be noted that two weeks earlier the Tier1 had suffered a power outage (which affected the whole of RAL). During this incident the diesel generator failed to pick up the load and all services, including those on the UPS, were unavailable for up to a day. Although different in nature, the previous incident could be considered to have been a useful training exercise, nevertheless a second major incident in such quick succession was a blow to team moral.

The intervention on the 20th November was planned. Early planning for this work proposed a shutdown of all services connected to the UPS. However, following a re-evaluation of the work it was recognised that by providing a temporary feed all (UPS) services could remain on and that such an operation was a low risk intervention. However, as described in the details above, an error led to an over-voltage being applied to equipment connected to the UPS causing significant damage. Dual power fed systems (2011 disk servers) survived both this incident and the previous one allowing enough time for a clean shutdown.

Damage was mainly sustained by power distribution units (PDUs), power supplies and network switches. The restart was cautious with an emphasis on validating the infrastructure rather than risking a quick attempt at a restart. This approach was borne out as although in many cases the PDUs had protected equipment there was one instance where an undetected faulty power supply broke the replacement PDU. The significant resources available to the Tier1 enabled alternative equipment to be brought into use (e.g. replacing network components) as well as various configuration changes that facilitated the re-restart. Whilst in some areas, particularly the power supplies for the fibrechannel SAN switches, the number of remaining available components was pushed to the limit, in many areas (PDUs, network switches) resources remained available to tackle the problems albeit with some creative solutions.

The GOC DB is also hosted at RAL and failed at the same time. As a result it was not possible to declare downtimes using it. However, the 'broadcast' mechanism was successfully used to inform the VOS and wider project of the problems. It was noted that a mis-communication led to some initial reports of there having been a fire, a message that was quickly corrected. During the first part of the incident the Tier1 team did not yet know the extent of the damage - a situation which only became clearer during the morning of the second day (21st Nov). Whilst communications took place by a number of routes regarding the services operational state subsequent feedback indicates that a full picture of what had happened was not distributed early or widely enough. In particular some indication of likely timescales for a return of services and the likely retention of existing data would have been helpful to VOs - even if only a 'best guess' early on.

The work of evaluating the damage and restoring networking and power to individual systems fell primarily on the Tier1 Fabric team. Whilst the overall direction of the recovery was guided there was insufficient focus on machine room priorities. Additional support could have been provided to the Fabric Team, for example by having others deal with some of the tasks less dependent on specific knowledge of our systems (e.g. network re-cabling). Although this may not have made much difference to the overall recovery time it would have eased pressure on a small number of individuals. The occurrence of the problem during a working week when key staff were present was fortunate although the incident highlighted some areas where expertise is concentrated in only two (or in a few small cases only one) person. In contrast the overall start-up went well with staff having had the experience of the power failure only two weeks earlier fresh in their minds. Service owners were ready to check out their services and make them quickly ready for production once the hardware issues had been resolved.

The Tier1 Disaster Management process was triggered. As the recovery of services was already proceeding this focused on issues of re-establishing resilience and ensuring financial resources were available if required. Although the service was re-started with a lack of resilience in some areas the wider team still had resources available that could have been used if damage was more extensive. These include borrowing networking equipment, PDUs, etc. from other parts of the organisation, and calling on staff elsewhere.

The ramp-up of the batch service to full capacity took an extended time. This incident occurred while the worker nodes were being upgraded to a new (EMI-2) software version. Two out of the ten batches of machines had been upgraded the day before the incident. As part of the progressive ramp up of the batch system the opportunity was taken to apply the upgrade to more of the worker nodes. This delayed the re-establishment of full batch capacity but avoided having to drain jobs out again in the following days. Following the power outage two weeks earlier a local 'cron' job was set-up to shutdown systems in the event of them detecting an over-temperature. For some nodes this triggered erroneously which took time to investigate before being resolved by a firmware update on the systems concerned.

Follow Up

Issue Response Done
Start Sequence Documentation (cookbook) out of date and needing updating. Capture the lessons learnt in the startup after this incident and update the this documentation. Yes
Some systems critically dependent on too few staff. Review systems to identify areas with too few staff having expertise. No
No call-out generated automatically to Tier1 staff notifying the incident. Review call-out system for the case where UPS power is lost (either momentarily or for a longer time). No
The optimisation of staff effort at critical points in the start-up can be improved. Create a checklist/agenda for review meetings that includes managing staff effort. Yes

Related issues

The Tier1 suffered an outage as a result of a site-wide power outage two weeks earlier. Post Mortem of this earlier incident at RAL Tier1 Incident 20121107 Site Wide Power Failure

Reported by: Gareth Smith. 27th November 2012

Summary Table

Start Date 20 November 2012
Impact >80%
Duration of Outage 24 hours for core services. Around 50 hours for Castor and batch.
Status Open
Root Cause Human Error
Data Loss No