Difference between revisions of "RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade"

From GridPP Wiki
Jump to: navigation, search
(Timeline of the Incident)
Line 28: Line 28:
 
|-
 
|-
 
| 8/04/2015 10:00
 
| 8/04/2015 10:00
| Downtime starts in the GOCDB.  Very shortly afterwards network connectivity between the offices and the machine room dropped.  Castor team (Rob A) spoke to Fabric.  Martin Bly was surprised that the network had gone down but he also believed Castor was already down.   
+
| Downtime starts in the GOCDB.  Very shortly afterwards network connectivity between the offices and the machine room dropped.  Castor team (RA) spoke to Fabric.  MB was surprised that the network had gone down but he also believed Castor was already down.   
 
|-
 
|-
 
| 8/04/2015 10:31
 
| 8/04/2015 10:31
Line 34: Line 34:
 
|-
 
|-
 
| 8/04/2015 10:15 - 10:40
 
| 8/04/2015 10:15 - 10:40
| Network is restored and Castor is brought down cleanly.  DB team asked to start work.  Rob A commits quattor change for software upgrade (as planned).
+
| Network is restored and Castor is brought down cleanly.  DB team asked to start work.  RA commits quattor change for software upgrade (as planned).
 
|-
 
|-
 
| 8/04/2015 11:05
 
| 8/04/2015 11:05
Line 43: Line 43:
 
|-
 
|-
 
| 8/04/2015 13:40
 
| 8/04/2015 13:40
| John puts entire site into downtime until 18:00.
+
| JK puts entire site into downtime until 18:00.
 
|-
 
|-
 
| 8/04/2015 14:00
 
| 8/04/2015 14:00
| Martin, Tim and James spend several hours attempting to restore functionality and stability to routers and core switches.
+
| MB, TF and JA spend several hours attempting to restore functionality and stability to routers and core switches.
 
|-
 
|-
 
| 8/04/2015 17:00 - 17:20
 
| 8/04/2015 17:00 - 17:20
| Review meeting with Martin, James, John and Rob A.  Network has been reverted to original state.  We agreed to bring Castor back up.   
+
| Review meeting with MB, JA, JK and RA.  Network has been reverted to original state.  We agreed to bring Castor back up.   
 
|-
 
|-
 
| 8/04/2015 17:30
 
| 8/04/2015 17:30

Revision as of 09:42, 30 April 2015

RAL-LCG2 Incident 20150408 network intervention preceding Castor upgrade

Change control for Castor upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=148453

Castor upgrade procedure: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CastorUpgradeTo211415

RT ticket tracking upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=149684

The Castor team were aware of the network intervention but did not include it in their plan as it was believed to be minor.


Impact

The entire Tier 1 and all services were declared in downtime until the following day.

Timeline of the Incident

When What
8/04/2015 8:10 Condor set not to start jobs for non-LHC VOs. Reminder email sent to gridppp users mailing list.
8/04/2015 10:00 Downtime starts in the GOCDB. Very shortly afterwards network connectivity between the offices and the machine room dropped. Castor team (RA) spoke to Fabric. MB was surprised that the network had gone down but he also believed Castor was already down.
8/04/2015 10:31 GOC DB entry added "Networking problem at RAL, we are investigating"
8/04/2015 10:15 - 10:40 Network is restored and Castor is brought down cleanly. DB team asked to start work. RA commits quattor change for software upgrade (as planned).
8/04/2015 11:05 (Thousands of) alarms due to network problems.
8/04/2015 11:45 Castor change is abandoned and rollback starts. Note: DB team never started upgrade, the time was spent on DB backups.
8/04/2015 13:40 JK puts entire site into downtime until 18:00.
8/04/2015 14:00 MB, TF and JA spend several hours attempting to restore functionality and stability to routers and core switches.
8/04/2015 17:00 - 17:20 Review meeting with MB, JA, JK and RA. Network has been reverted to original state. We agreed to bring Castor back up.
8/04/2015 17:30 Castor is brought up, however basic Castor tests do not run cleanly even with no load on service. People have gone home so Castor is left 'up' but site is declared in downtime until 14:00 the following day.
8/04/2015 18:07 Next GOC DB entry "Network problems continue at RAL"
8/04/2015 17:30 - 21:30 AD speaks to RA and JK to understand extent of problems. We check if there is any disaster recovery documentation on twiki (there is but it is from 2009). Email out plan for following day.
8/04/2015 18:57 Next GOC DB entry - for overnight "Network problems continue at RAL"
9/04/2015 09:15 CASTOR tests all reporting green, errors seen last night no longer visible.
9/04/2015 09:30 Meeting in production area. There was concern about high CPU load on 2 switches and unbalanced network links. Decided to run a few more Castor tests, run ping tests and then reboot the Z9000 #2.
9/04/2015 09:58 Update to Tier1 Dashboard "There are problems on the Tier1 network. We are working on it and expect a further update later today."
9/04/2015 10:00 ping tests OK, MB reboots the second Z9000. After a settling down period and further tests showing green, it was decided to end the downtime at 11:00. VOs informed.
10/04/2015 17:00 Final problems with Disk servers resolved. (Castor Team - following notification of LHCb problem from RN).
13/04/2015 12:00 In response to a ticket from t2k, the non-LHC VOs were re-enabled on the batch farm.

Incident details

Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date 18th Febuary 2014
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage 2 X 3hour outages, 6hours in total
Status Draft
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No