RAL Tier1 Incident 20090324 Power dip caused Tier1 outage

From GridPP Wiki
Jump to: navigation, search

Site: RAL-LCG2

Incident Date: 2009-03-24 & 25

Severity: Field not defined yet

Service: Entire Site

Impacted: All VOs

Incident Summary: Two large power dips across the RAL site on the morning of Tuesday 24th March led to a complete outage of the Tier1. These were caused by a failure in an underground power cable on the site.

Type of Impact: All Services Unavailable

Incident duration: 33 hours

Report date: 2009-03-25

Reported by: Gareth Smith

Related URLs: None

Status of this report: Incident and comment logged. Appropriate actions will be decided at a review to be held in the coming days and recorded here.

Incident details:

Date Time Who/What Entry
2009-03-24 05:20 Power glitch across RAL site. Various services at the Tier1 (including the GOC DB) fail. However, no pager call.
2009-03-24 05:30 Operations Called to Site
2009-03-24 06:30 Operations on-call on Site
2009-03-24 07:00 Operations team leader on Site
2009-03-24 07:20? M. Bly Receive notification of power Problems (from Atlas operations team).
2009-03-24 07:15 A. Sansum Unrelated to power outage (not yet advised), checked nagios state and seemed OK. Matt Viljoen also confirms that nagios was supply old (out of date) info from before power fail.
2009-03-24 07:30 M. Bly Contacted Primary Tier-1 on-call (Andrew Sansum)
2009-03-24 07:45 A. Sansum Primary Tier-1 on-call contacted CASTOR on-call and was contacted by GRID on-call. Could not access wiki and confirm who database on-call was so did not call. Not very worried as nagios state looked OK (incorrectly).
2009-03-24 08:20 M.Viljoen Contact Gilles Mathieu (at CHEP09) to inform of GOCDB outage and initiate a failover of GOCDB to ITWM in Germany
2009-03-24 08:22 M.Hodges Send EGEE broadcast. (Cannot put site into outage as GOC DB down).
2009-03-24 08:28 AoD (G.Smith) E-mail sent to UK experiment reps, GridPP_users and Atlas_uk_comp_operations lists alerting them of problem.
2009-03-24 08:40 Another power glitch acrosss site. Decision taken to shutdown all(? or just Castor) rebooted servers until a stable power supply is achieved. All running servers are shutdown cleanly.
2009-03-24 09:30 Tier1 team Meeting to discuss restart and implment planned sequence once stable power confirmed.
2009-03-24 09:39 E-mail received stating power now OK.
2009-03-24 09:57 E-mail received stating power NOT OK.
2009-03-24 10:10 E-mail received stating power now OK.
2009-03-24 10:17 AoD (G.Smith) E-mail sent to UK experiment reps, GridPP_users and Atlas_uk_comp_operations giving more details and stating "It is hoped that essential services will be restored by early afternoon and full service by the end of the day."
2009-03-24 10:20 First pager message receicved indicating a problem.
2009-03-24 10:30 Confirmed site networking OK. Tape robotics OK.
2009-03-24 10:38 Failover GOCDB up and running publishing downtime of Tier 1 outage
2009-03-24 10:45 NIS and home file servers OK.
2009-03-24 11:00 M. Bly Local network (176) OK.
2009-03-24 11:15 J. Wheeler mailer (PAT) OK.
2009-03-24 11:40 M. Viljoen Castor Database servers and LSF license server up.
2009-03-24 11:45 Network problems discuovered internal to Tier1.
2009-03-24 11:45 Team meeting to plan.
2009-03-24 13:00 M. Viljoen Castor LSF servers up
2009-03-24 13:00 Team meeting to plan. The Tier1 network has a loop and doesn't work. That is being given highest priority.

We know we have the following failures: 3 PSUs or PDUs and one network switch

2009-03-24 13:45 M. Bly Network problems resolved (loop found).
2009-03-24 13:50 Team meeting to plan.
2009-03-24 14:15 Site BDII OK, Castor ADM up
2009-03-24 14:30 Top Level (UK) BDII up, Castor LSF server up.
2009-03-24 14:40 Puppet master up.
2009-03-24 14:50 AoD (G.Smith) E-mail sent to UK experiment reps, GridPP_users and Atlas_uk_comp_operations stating BDII and LFC up (or up soon).
2009-03-24 15:00 Castor Nameserver up, LFC/FTS Database up (LFC up)
2009-03-24 15:30 J. Wheeler Nagios up (alarms disabled on it).
2009-03-24 15:40 3D Databases up (But note: problems subsequently found on Atlas 3D database - ogma)
2009-03-24 16:00 C.Condurache VO Boxes up.
2009-03-24 16:00 Team meeting to plan.
2009-03-24 16:30 M.Bly Critical filesystems up.
2009-03-24 17:00 G. Smith Ended whole site outage in GOC DB. New outage for only the SRMs (Castor), UIs and CEs, along with OGMA (Atlas 3D database) in outage until midday tomorrow (25th).
2009-03-24 17:14 AoD (G.Smith) E-mail sent to UK experiment reps, GridPP_users and Atlas_uk_comp_operations. Stating site up except Castor (all SRM end points); All CEs; Atlas 3D database ('ogma'); WMS02.
2009-03-24 17:37 Delayed message in system received (RT ticket, e-mail) from H.Patel announcing power problems!
2009-03-24 18:00 Fabric team All Castor disk servers powered and staretd to 'fsck' the disks where appropriate.
2009-03-25 ??:?? M.Bly Kernel updates applied to disk servers. Start investigating some remaining disk server issues.
2009-03-25 09:36 C.Cioffi Report Atlas 3D database back up. (09:58 - AoD sends e-mail to Atlas UK Comp Operations informing them of this).
2009-03-25 10:55 AoD (G.Smith) Status update sent to UK experiment reps, GridPP_users and Atlas_uk_comp_operations lists.
2009-03-25 12:00 Batch scheduler running.
2009-03-25 12:00 2nd pass of checking disk servers completed..
2009-03-25 12:00 AoD (G.Smith) GOC DB updated. Downtimes extended for Castor and CEs (batch system).
2009-03-25 13:23 Castor available. Outage ended in GOC DB.
2009-03-25 14:30 Batch system available. Outage on CEs ended.
2009-03-25 15:30 J.Wheeler Nagios functioning (following datbase problem.)
2009-03-25 15:55 AoD (G.Smith) Notification of all services up sent to UK experiment reps, GridPP_users and Atlas_uk_comp_operations lists.


Future mitigation:


  • Determine exact cause of power outage and obtain assurance that measures have been taken to avoid a repeat incident.
    • Staff were about to restart some services before there was confirmation that the root cause of the power problem had been resolved, when a second glitch struck. Review the procedure for verifying power issues fully resolved.
  • There was no automatic callout to the pager system following the first power glitch.
    • Understand this and take appropriate corrective action.
  • There was a planned 'At Risk' time for a networking change. This change was carried out while most systems were down and may have contributed to the delays caused by the networking.
    • Understand this and improve the process by which planned work may (or may not) be carried out during an ongoing downtime for another reason.
  • Review the startup sequence and look where this can be optimised.
    • Ensure nothing has been overlooked in the plan.
    • Consider different cases depending on out/in hours when more (or less) people are available.
    • Update CASTOR cold startup sequence documentation (Castor team, Cheney Ketley)
    • Nagios was not functioning until after all externally visible sevices were restored. Review where this appears in the priority list. Much time was spent trying to recover the Nagios database although this is not critical to its operation. The suspending of alarms from nagios did help during the early phases of the outage.
  • Other issues
    • Determine location of all CASTOR servers and make this information available to others (Fabric team)
    • Review labels of central CASTOR servers to avoid ambiguity in identification (Fabric team)
    • Investigate whether we would have coped better had the Tier1 been located in R89
    • Lack of trolleys for consoles in the machine rooms caused some delays.
    • As the situation changed, updates to the GOC DB were made too late before the existing outage(s) expired. Need to ensure that these are made earlier.
  • Subsequent problems (probably) provoked by the power glitches.
    • The failover of MyProxy and the FTS outage on Friday were both probably related to the power failure. E-mails about the RAID 1 failures were sent to farm root mail early Tuesday afternoon, and an opportunity existed on Tuesday to fix the FTS problems while CASTOR remained down. (A Nagios check to alarm on such RAID 1 failures is being implemented. This would regquire Nagios up.)
    • There were two separate failures of the tape system in the days following the outage.

Update 2009-05-06: Two meetings have been held to review the issues raised. Actions from these detailed below:

  • 2009-04-08: Review all issues other than the startup sequence. Actions following this:
    • Emphasize to whole team that in the event of an incident (such as the power glitches), the Primary On-Call is the first point of contact.
    • Review how site secirity make contact with the Tier1 staff.
    • Update restart sequence to REQUIRE we are certain the power is OK before restarting.
    • Define protocol for how we verify if there has been a power outage and if power now OK.
    • Understand why there was no pager callout.
    • Modify procedures such that no unnecessary changes made during the restart after a power outage.
    • Add verification of the network to the starup sequence.
    • Verify labelling of machines and make MIMIC display more accurately represent physical layout.
    • Verify availability of more console trolleys in new building.
    • Improve timeliness of updating of GOC DB entries in such a incident.
  • 2009-04-23. Review of startup sequence. Actions following this.
    • MS Project and dependency diagram to be updated in light of discussions held.
    • Gather further updates and corrections to the above documents.
    • Review the startup of the Castor disk servers and how that can be speeded up.
    • Use information to revise startup document.
    • Use move to new computer building to test components of plan and familiarise staff.

Related issues:

There was an (unrelated?) issue with the CIC system not sending downtime notifications that co-incided with this issue. This hampered notifications.

The root cause was the failure of a junction in an underground power cable at RAL.

Timeline

Date Time Comment
Actually Started 2009-03-24 05:20 First power glitch
Fault first detected 2009-03-24 M.Bly received notification by phone
First Advisory Issued 2009-03-24 08:22 EGEE Broadcast
First Intervention 2009-03-24 10:10 Confirmation that the power was OK. (Preparatory work had started before this, but we were unable to start the recovery.)
Fault Fixed 2009-03-25 14:30 Batch systems re-opened.
Announced as Fixed 2009-03-25 14:30 End last part of outage in GOCDB
Downtime(s) Logged in GOCDB 2009-03-24 & 25 All the following as Unscheduled Outages
whole site: 2009-03-24 05:20 to 12:00
whole site: 2009-03-24, 12:00 to 17:00
whole site: 2009-03-24, 17:00 to 18:00
WMSS02: 2009-03-24, 17:00 to 18:00
SRMs; UIs; CEs; 3D: 2009-03-24, 17:00 to 2009-03-25, 12:00
Atlas Castor 2009-03-25, 12:00 to 13:23
Castor (all except Atlas) 2009-03-25, 12:00 to 13:23
CEs 2009-03-25, 12:00 to 14:30
Other Advisories Issued 2009-03-24 & 25 E-mail to UK experiment reps, GridPP_users and Atlas_uk_comp_operations lists.