RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade
Contents
RAL-LCG2 Incident 20150408 network intervention preceding Castor upgrade =====Description:=
Change control for Castor upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=148453
Castor upgrade procedure: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CastorUpgradeTo211415
RT ticket tracking upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=149684
The Castor team were aware of the network intervention but did not include it in their plan as it was believed to be minor.
Impact
The entire Tier 1 and all services were declared in downtime until the following day.
Timeline of the Incident
When | What |
---|---|
8/04/2015 10:00 | Downtime starts in the GOCDB. Very shortly afterwards the network dropped (The offices could not contact the machine room). Castor team (Rob A) spoke to Fabric. Martin Bly, was surprised that the network had gone down but he also believed Castor was already down. |
8/04/2015 10:15 - 10:40 | Network is restored and Castor is brought down cleanly. DB team asked to start work. Rob A commits quattor change for software upgrade (as planned). |
8/04/2015 11:45 | Castor change is abandon and rollback starts. Note: DB team never started upgrade. |
Incident details
Analysis
This section to include a breakdown of what happened. Include any related issues.
Follow Up
This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.
Issue | Response | Done |
---|---|---|
Issue 1 | Mitigation for issue 1. | Done yes/no |
Issue 2 | Mitigation for issue 2. | Done yes/no |
Related issues
List any related issue and provide links if possible. If there are none then remove this section.
Reported by: Your Name at date/time
Summary Table
Start Date | 18th Febuary 2014 |
Impact | Select one of: >80%, >50%, >20%, <20% |
Duration of Outage | 2 X 3hour outages, 6hours in total |
Status | Draft |
Root Cause | Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load |
Data Loss | Yes/No |