Difference between revisions of "RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade"

From GridPP Wiki
Jump to: navigation, search
(Incident details)
Line 87: Line 87:
 
===Incident details===
 
===Incident details===
  
 +
The opportunity was being taken during a planned outage of Castor to make a change to the underlying network while Castor was down. This network change was planned to take place at the start of the Castor intervention (i.e. when Castor was down but before its changes were being made).
 +
 +
Thep pre-existing plan was to change the connection to the network switch connecting the Castor headnodes from a single link to a resilient link (to a pair of 4810 switches). This change was lined up to be done ahead of the Castor upgrade (but after Catsor was shudtown). This simple change was not the subject of a separate change control, but was considered a 'standard change'. However, this change required some preparatory work (laying in cables etc.) which had not been completed by the end of the previous day. It was therefore agreed on teh previous day that his change would not happen.
 +
 +
However, staff arrived early on the morning of 8th April and the necessary preparations for the network change were completed before the planned Castor outage. Following a discussion on the morning of the 8th April, involving the person runing the Castor change, the Admin on Duty and Fabric Team staff member it was agreed the change would now go ahead.
 +
 +
It was also agreed on that morning to make a change to teh network connectivity of the Hogh Avaialbility CVMFS stratum 1 servers. This involved moving their connections from one switch to another. This change was ndependent of the previous change, and did not directly involve teh Casteor system. However, it did involved moving cables from the switch used in teh non-resulient connection to teh Castor Headnodes. This change was agreed between teh Fabrice Team and the CVMFS service owner.
 +
 +
The first network change started early before Castor services were stopped. This initial change taking place at around (or even slightly before) the announced start time for the whole intervention. The first steps in the network change were to disconnect the existing link cable to the switch hosting Castor headnodes and reconnect that switch to the network mesh. Around this time network problems were found by staff working in offices. There was a loss of access to the Castor Headnodes. It is not now known if network problems were more widespread at this point. It is most likely this loss of access was caused by the network intervention.
 +
 +
Within around 5 minutes(??) the next step of the network change was undertaken. However, an error (re-plugging the wrong end of the cable) caused a network loop. This immediately caused a packet storm. In response to this the UK light router cut the link to the Tier1 network. (Time to be confirmed). After around 30 seconds staff realised there was a problem and the cable was again disconnected. At th time of the post mortem discussion it was not clear whether the the cabling was then correctly changed to the new configuration for the second change. A member of staff arrived in the machine room to inform the person working on teh network change that there were network problems.
 +
 +
From this point there were significant problems with teh Tier1 network with an awareness of packet loss. Also seen was high load on the two 4810 switches (involved in the Castor HN change) and the network change for the Castor headnodes was reverted.
 +
 +
The problems persisted for some hours. During the afternoon, although there was still packet loss, the network was more stable. Efforts were focused on fixing the 'bypass' link to the OPN (to CERN) - as it was noted that this link had died. The paired link from the Tier to the UKlight router wouldn't come up. Time was spend during the afternoon trying to fix that problem as it was not clear to the person working on the problem that the rest of the Tier network was in trouble.
 +
 +
It became clear that one of the links to the UKLight router had failed some days earlier (on the 2nd April) and this was confusing the investigation of the problems on that link. Furthermore the exact patten of network breaks to services hosted on the Tier1 network during the day are not known, although there appears to be two extended outages affecting access to the Hermes database. It is not understood why this was the case.
 +
 +
During the afternoon effort focused on the main Tier1 network - and it was realised that the the two z9000 switches were having problems. Around 4pm rebooted the first (time TBC). This cleared up the network sufficiently to test Castor.
 +
 +
The following morning traffic patterns through second z9000 were still anomalous. This unit was then restarted and that seemed to clear the problems completely.
 +
 +
In the morning the upgrade of Castor had been started. As a result of the network problems it was decided that the Castor update should be abandoned and the steps taken thus (rolling out some software updates) reverted. (No changes had been made to the Castor databases). Duiring teh afternoon problem were encountered with the Castor downgrade. The Quattor rollback reverted the RPMs but for some reason some inappropraite symlinks left in place.
 +
 +
There was wome confusion about leaving Castor down overnight betwen the 8th/9th April. Rather than switching Castor off it was left running but declared down in GOC DB.
  
 
===Analysis===
 
===Analysis===

Revision as of 11:39, 1 May 2015

RAL-LCG2 Incident 20150408 network intervention preceding Castor upgrade

Change control for Castor upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=148453

Castor upgrade procedure: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CastorUpgradeTo211415

RT ticket tracking upgrade: https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=149684

The Castor team were aware of the network intervention but did not include it in their plan as it was believed to be minor.

Description

Impact

The entire Tier 1 and all services were declared in downtime until the following day.

Timeline of the Incident

When What
8/04/2015 8:10 Condor set not to start jobs for non-LHC VOs. Reminder email sent to gridppp users mailing list.
8/04/2015 10:00 Downtime starts in the GOCDB. Very shortly afterwards network connectivity between the offices and the machine room dropped. Castor team (RA) spoke to Fabric. MB was surprised that the network had gone down but he also believed Castor was already down.
8/04/2015 10:31 GOC DB entry added "Networking problem at RAL, we are investigating"
8/04/2015 10:15 - 10:40 Network is restored and Castor is brought down cleanly. DB team asked to start work. RA commits quattor change for software upgrade (as planned).
8/04/2015 11:05 (Thousands of) alarms due to network problems.
8/04/2015 11:45 Castor change is abandoned and rollback starts. Note: DB team never started upgrade, the time was spent on DB backups.
8/04/2015 13:40 JK puts entire site into downtime until 18:00.
8/04/2015 14:00 MB, TF and JA spend several hours attempting to restore functionality and stability to routers and core switches.
8/04/2015 17:00 - 17:20 Review meeting with MB, JA, JK and RA. Network has been reverted to original state. We agreed to bring Castor back up.
8/04/2015 17:30 Castor is brought up, however basic Castor tests do not run cleanly even with no load on service. People have gone home so Castor is left 'up' but site is declared in downtime until 14:00 the following day.
8/04/2015 18:07 Next GOC DB entry "Network problems continue at RAL"
8/04/2015 17:30 - 21:30 AD speaks to RA and JK to understand extent of problems. We check if there is any disaster recovery documentation on twiki (there is but it is from 2009). Email out plan for following day.
8/04/2015 18:57 Next GOC DB entry - for overnight "Network problems continue at RAL"
9/04/2015 09:15 CASTOR tests all reporting green, errors seen last night no longer visible.
9/04/2015 09:30 Meeting in production area. There was concern about high CPU load on 2 switches and unbalanced network links. Decided to run a few more Castor tests, run ping tests and then reboot the Z9000 #2.
9/04/2015 09:58 Update to Tier1 Dashboard "There are problems on the Tier1 network. We are working on it and expect a further update later today."
9/04/2015 10:00 ping tests OK, MB reboots the second Z9000. After a settling down period and further tests showing green, it was decided to end the downtime at 11:00. VOs informed.
10/04/2015 17:00 Final problems with Disk servers resolved. (Castor Team - following notification of LHCb problem from RN).
13/04/2015 12:00 In response to a ticket from t2k, the non-LHC VOs were re-enabled on the batch farm.

Incident details

The opportunity was being taken during a planned outage of Castor to make a change to the underlying network while Castor was down. This network change was planned to take place at the start of the Castor intervention (i.e. when Castor was down but before its changes were being made).

Thep pre-existing plan was to change the connection to the network switch connecting the Castor headnodes from a single link to a resilient link (to a pair of 4810 switches). This change was lined up to be done ahead of the Castor upgrade (but after Catsor was shudtown). This simple change was not the subject of a separate change control, but was considered a 'standard change'. However, this change required some preparatory work (laying in cables etc.) which had not been completed by the end of the previous day. It was therefore agreed on teh previous day that his change would not happen.

However, staff arrived early on the morning of 8th April and the necessary preparations for the network change were completed before the planned Castor outage. Following a discussion on the morning of the 8th April, involving the person runing the Castor change, the Admin on Duty and Fabric Team staff member it was agreed the change would now go ahead.

It was also agreed on that morning to make a change to teh network connectivity of the Hogh Avaialbility CVMFS stratum 1 servers. This involved moving their connections from one switch to another. This change was ndependent of the previous change, and did not directly involve teh Casteor system. However, it did involved moving cables from the switch used in teh non-resulient connection to teh Castor Headnodes. This change was agreed between teh Fabrice Team and the CVMFS service owner.

The first network change started early before Castor services were stopped. This initial change taking place at around (or even slightly before) the announced start time for the whole intervention. The first steps in the network change were to disconnect the existing link cable to the switch hosting Castor headnodes and reconnect that switch to the network mesh. Around this time network problems were found by staff working in offices. There was a loss of access to the Castor Headnodes. It is not now known if network problems were more widespread at this point. It is most likely this loss of access was caused by the network intervention.

Within around 5 minutes(??) the next step of the network change was undertaken. However, an error (re-plugging the wrong end of the cable) caused a network loop. This immediately caused a packet storm. In response to this the UK light router cut the link to the Tier1 network. (Time to be confirmed). After around 30 seconds staff realised there was a problem and the cable was again disconnected. At th time of the post mortem discussion it was not clear whether the the cabling was then correctly changed to the new configuration for the second change. A member of staff arrived in the machine room to inform the person working on teh network change that there were network problems.

From this point there were significant problems with teh Tier1 network with an awareness of packet loss. Also seen was high load on the two 4810 switches (involved in the Castor HN change) and the network change for the Castor headnodes was reverted.

The problems persisted for some hours. During the afternoon, although there was still packet loss, the network was more stable. Efforts were focused on fixing the 'bypass' link to the OPN (to CERN) - as it was noted that this link had died. The paired link from the Tier to the UKlight router wouldn't come up. Time was spend during the afternoon trying to fix that problem as it was not clear to the person working on the problem that the rest of the Tier network was in trouble.

It became clear that one of the links to the UKLight router had failed some days earlier (on the 2nd April) and this was confusing the investigation of the problems on that link. Furthermore the exact patten of network breaks to services hosted on the Tier1 network during the day are not known, although there appears to be two extended outages affecting access to the Hermes database. It is not understood why this was the case.

During the afternoon effort focused on the main Tier1 network - and it was realised that the the two z9000 switches were having problems. Around 4pm rebooted the first (time TBC). This cleared up the network sufficiently to test Castor.

The following morning traffic patterns through second z9000 were still anomalous. This unit was then restarted and that seemed to clear the problems completely.

In the morning the upgrade of Castor had been started. As a result of the network problems it was decided that the Castor update should be abandoned and the steps taken thus (rolling out some software updates) reverted. (No changes had been made to the Castor databases). Duiring teh afternoon problem were encountered with the Castor downgrade. The Quattor rollback reverted the RPMs but for some reason some inappropraite symlinks left in place.

There was wome confusion about leaving Castor down overnight betwen the 8th/9th April. Rather than switching Castor off it was left running but declared down in GOC DB.

Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date 18th Febuary 2014
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage 2 X 3hour outages, 6hours in total
Status Draft
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No