RAL Tier1 Incident 20120316 Network Packet Storm

From GridPP Wiki
Revision as of 10:45, 7 June 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Incident 16th March 2012: Network Packet Storm

Description:

Whilst adding a system to the Tier1 network a packet storm was triggered. This led to the routers connecting the Tier1 to the rest of the RAL site to drop the links and disrupted a number of network switch stacks within the Tier1 network. It took some time to resolve all the switch stack problems. A network switch was removed from one of the stacks in order to get it functioning again.

Impact

The incident caused an outage of the whole Tier1 for 3.5 hours, then a further outage for Castor (and batch) for Atlas, CMS & LHCb for just under another three hours.

Timeline of the Incident

When What
2011-03-15 Additional Network switch added to Tier1 network via four bonded links. This switch was configured ready to receive connections from the new hypervisors (4 bonded links for each of the five hypervisors).
2011-03-16 09:30 (approx) AK - Connected cloud-hv01 to the ‘cloud switch’ via the primary link of 4 bonded 1G links. Both cloud-hv01 and the switch were configured for bonding but only the primary link (cloud switch port 9 to eth2) was brought up (i.e. didn’t do ifup eth3 etc but all 4 cables were connected). The cloud switch was already connected to stack 9 by 4 bonded 1G links, and cloud-hv01 was already connected to stack 9 via eth0 (1G).
2011-03-16 09:31 Log entries in router (UKLight & RouterA) state shutting down ports owing to excessive control frames.
2011-03-16 09:30 (approx) JRHA - Started Investigating Problem, AK removed connections into cloud switch.
2011-03-16 09:40 NM (RAL Networking) informed Tier1 staff that he had seen a packet storm and noted that the UKLight Router had dropped the connections to the Tier1 (C300). Discussions between Tier1 staff (JRHA, GRS) and NM agreed to re-enable ports on Router A and UKLR. Following this NM re-enabled the links to the C300 from both the UKLight Router and Router A, which was also found to have dropped its link to the Tier1 network.
2011-03-16 09:44 Tiju - Declared outage in GOCDB until 11:00 while investigations took place.
2011-03-16 10:00 JRHA - Verfied external links back up, waited for monitoring systems to settle.
2011-03-16 10:30 JRHA - Four switch stacks were in a non-functional state, started rebooting them one-by-one, stacks 4 & 6 troublesome.
2011-03-16 10:50 Gareth - Extend Outage in GOC DB to 17:00.
2011-03-16 12:00 (approx) Gareth - ask for FTS to be enabled with only the non-RALLCG2 channels running.
2011-03-16 13:00 (approx) Gareth - Site outage ended in GOC DB at 13:05. Add outage for Castor (Atlas, CMS, LHCb as some disk servers not available) and batch starting at 13:00 until 17:00.
2011-03-16 13:10 JRHA - Continued work restarting stack 4.
2011-03-16 14:00 (approx) Andrew L. - Batch services restarted (not for Atlas, CMS * LHCb as Castor still down owing to missing disk servers for these instances).
2011-03-16 15:10 JRHA - Stack 4 restored. After much trial and error one unit (out of six) was found to be faulty and was removed from the stack.
2011-03-16 15:20 Chris (Castor On Duty) - Checking out Castor - restarting LSF on disk servers.
2011-03-16 15:45 Castor up. Started FTS on RAL channels. Around 10 minutes later batch was opened up for Atlas, CMS & LHCb.
2011-03-16 15:50 Final Outage in GOC DB ended.

Incident details

At about 09:30 on Friday 16th March hypervisor cloud-hv01 was connected to the ‘cloud switch’ via the primary link of 4 bonded 1G links. Both cloud-hv01 and the switch were configured for bonding but only the primary link (cloud switch port 9 to eth2) was brought up (i.e. "ifup eth3" etc was not done but all 4 cables were connected). The cloud switch was already connected to stack 9 by 4 bonded 1G links, and cloud-hv01 was already connected to stack 9 via eth0 (1G).

What most likely happened is a mis-configuration on cloud-hv01 caused eth0 and bond0 to both be enslaved on a virtual bridge device br0, creating a network loop. The original configuration was br0 on eth0, and the intended final configuration was br0 on bond0 only.

It was initially thought that there was a problem on the hypervisor connection and the the links to the cloud switch were disconnected - which was ahead of realisation that there was a more widespread problem.

The Router logs showed the following:

Router-A Log:
CPU5 [03/16/12 09:49:31] SNMP INFO Port 8/3 is an access port
CPU5 [03/16/12 09:49:31] SNMP INFO Link Up(8/3)
CPU5 [03/16/12 09:49:31] IP ERROR rcIpAddRoute: addIpRoute failed
CPU5 [03/16/12 09:31:57] CPU WARNING Shutdown port 8/3 due to excessive control frames multicast 0, broadcast 10030 packet per second
CPU5 [03/16/12 09:31:54] SNMP INFO Port 8/3 is an access port
CPU5 [03/16/12 09:31:54] SNMP INFO Link Down(8/3) due to excessive control frames

UKLR Log:
CPU6 [03/16/12 09:30:03] SNMP INFO Link Down(3/1) due to excessive control frames
CPU6 [03/16/12 09:30:03] SNMP INFO Port 3/1 is an access port
CPU6 [03/16/12 09:30:03] CPU WARNING Shutdown port 3/1 due to excessive control frames multicast 0, broadcast 10536 packet per second
CPU6 [03/16/12 09:31:44] SNMP INFO Link Down(7/1) due to excessive control frames
CPU6 [03/16/12 09:31:44] SNMP INFO Port 7/1 is an access port
CPU6 [03/16/12 09:31:44] CPU WARNING Shutdown port 7/1 due to excessive control frames multicast 0, broadcast 12222 packet per second
CPU6 [03/16/12 09:38:11] SW INFO user rwa connected from 192.168.51.101 via telnet
CPU6 [03/16/12 09:47:42] SNMP INFO Link Up(3/1)
CPU6 [03/16/12 09:47:42] SNMP INFO Port 3/1 is an access port
CPU6 [03/16/12 09:47:45] SNMP INFO Link Up(7/1)
CPU6 [03/16/12 09:47:45] SNMP INFO Port 7/1 is an access port

Analysis

New linux based hypervisors were being added to the Tier1 network. Each hypervisor to be connected via a single 1Gigabit ethernet link plus a four Gigabit links bonded together. The single connection is needed for the hypervisor systems to boot. The four bonded links provide greater bandwidth for guest virtual machines.

Before this incident each of the hypervisors was connected to the Tier1 network via a singe Gigabit ethernet cable. Within each hypervisor there is a software network switch which handles the network traffic to the virtual machines. The presence of this switch effectively means that when the second connection (the four bonded channels) were connected a network loop was created through the virtual switch in the hypervisor. This triggered a packet storm that disrupted a number of switch stacks within the Tier1 network and also caused the site routers to drop links. Staff had not realised the additional risk introduced by the virtual network switch within the hypervisor.

The problem was recognised quickly and the loop broken. This was facilitated by a quick notification from central networking staff to Tier1 staff that a router had dropped its links in response to a packet storm. The mis-configuration (i.e. network loop) was backed out out within a few minutes stopping the packet storm and the links to the site routers were quickly re-enabled. However, the problems triggered within the network stacks took considerable time resolve. In particular one stack was very problematic and could not be successfully restarted until one of the switches within the stack had been removed.

The new equipment was added to a network switch that was itself connected into the switch stack that feeds Tier systems in the UPS room and therefore feeds services that are crucial for Tier1 operations. Depending on the nature of a networking problem the ability to isolate it away from important services could be important. However, in this case the packet storm caused problems across the whole of the Tier1 network and isolation after the packet storm was not a factor in the recovery. It was noted that the Tier1 network does not currently have a system in place to prevent such a packet storm and that spanning tree has not been enabled because it has created problems for system installation in the past.

Follow Up

Issue Response Done
A packet storm triggered by an intervention on the Tier1 network led to a significant outage. The network does not have any protection against this. Investigate adding some form of storm suppression within the Tier1 network. Also investigate the use of Spanning Tree with Fast Learning Mode as way of preventing network mis-configurations within the Tier1 network. No
The addition of the new infrastructure, initially for test, was a key part of the Tier1 and could not be isolated. Change procedures such that new installations of test/development equipment be placed at the edges of the Tier network where possible, and (even if spanning tree is not enabled everywhere) consider adding spanning tree to the individual switches being used for the test as a precaution against mis-configuration. No
Although the cause of the problem was quickly found and network links re-enabled, the outage took a long time to be fully resolved. A critical review of progress, at around midday on 16th March in this case, could have led to a faster final full resolution. Ensure reviews of incidents are more rigorously carried out, gathering relevant people together for an overall assessment at key points as the incident unfolds. Yes
The risks associated with a virtual network switch in the linux based hypervisors had not been appreciated. In subsequent discussion it became clear that this is part of a general risk when making multiple connections to systems and that this is covered as part of general networking/LAN skills and awareness. No specific action is appropriate here. N/A

Reported by: Gareth Smith 28th March 2012

Summary Table

Start Date 16 March 2012
Impact >80%
Duration of Outage 6.5 hours
Status Open
Root Cause Configuration Error
Data Loss No