RAL Tier1 Incident 20140830 Network Related Problems

From GridPP Wiki
Jump to: navigation, search

RAL-LCG2 Incident 20130830 Network Related Problems

The on-call team was called out for a number of separate callouts. These were traced to a failed network switch. Staff attended on site and this was worked round. However, a number of other problems became apparant which were affecting VMs on the Microsoft Hyper-V virtualistion platform. A second switch stack was found to also have problems and was fixed (the stack reset). Following the resolution of both these problems some remaining systems were found to have a configuration error that meant they required a DHCP server which was not initially available.

Impact

The problem initally appeared at around half past midnight on the morning of Saturday 30th August. A range of services were affected, including the CEs and MyProxy server. Storage (Castor) services were unaffected. The Tier1 site (apart from Castor storage) was declared down in the GOC DB for 5.5 hours from 09:00 on Saturday 30th August although the problem had been present for some time (possibly by 8 or 9 hours) beforehand.

Timeline of the Incident

When What
Saturday 30 August 00:00 Last recorded data by Cacti from many stacks with 172.16.180.0/21 addresses before problem started. No contact with Cacti till 11:00 Mon 1st Sept. Assume switch stack unit 5b died somewhere after 00:00.
Saturday 30 August 00:15:08 Cacti host NIC down. Probably time of switch 5b death. Cacti last previous NTP sync at 23:03 on 29th so time likely to be accurate.
Sat. 30th August 00:29 First callout "Check_logging_on_system_loggers_on_host_logger2" (Connection refused or timed out). This was followed by another callout on logger 2 and one for host gdss142 (also Connection refused or timed out).
Sat. 30th August 00:57 Primary On-Call (PoC) starts working. Concludes the systems notified (one of the loggers and the soon-to-be-retired software server gdss142) can wait until morning.
Sat. 30th August 01:10 PoC suspects a network problem, probably affecting only a part of the services.
Sat. 30th August 01:45 Callout on argus02 (Socket timeout after 90 seconds). PoC subsequently reboots that node.
Sat. 30th August 02:10 Callout for "Tier1_service_db_system_load_-_night_on_host_lcgdb05".
Sat. 30th August 02:12 PoC records tests clearing after argus02 reboot.
Sat. 30th August 02:29 PoC contacts DB Team on-call. PoC tries unsuccessfully to access Tier-1 Status page.
Sat. 30th August 03:12 DB on-call assessed lcgdb05 node seemed to have lost the backup partition
Sat. 30th August 03:29 Load slowly increases on lcgdb05. Database good so far (no errors or performances issue) and next backup postponed for midday Saturday (2 hours delay)
Sat. 30th August 03:40 Callout for lcgargus02 again (CHECK_NRPE: Socket timeout after 30 seconds). PoC put it in downtime until midday Saturday (this is the Argus standby host)
Sat. 30th August 09:00 Start of unscheduled 'outage' in the GCDB for the whole Tier1 apart from Castor.
Sat. 30th August 09:49 Next PoC on duty concludes most likely a network hardware problem and contacts Fabric Team On-call. List of machines given by PoC: lcgwww,lcgsquid0688,gdss142,logger2,cacti
Sat. 30th August 11:25 FoC informs PoC that one unit in Stack 5 is down, and the machines mentioned above have been moved to different units in the stack. PoC informs FoC that some VMs are unresponsive and asks him to check if there are any problems on UPS room networking.
Sat. 30th August ~11:53 (FT-OC) Stack 9 reset. No record in Stack 9 logs of issues prior to this.
Sat. 30th August ~13:00 Primary On-Call carries on investigating as some machines still not available. A move of the DHCP server (a VM that was moved to another hypervisor) enabled most of the remaining machines to restart.
Sat. 30th August 14:24 End of unscheduled 'outage' in GOCDB. Converted to a 'warning'.
Mon. 1st September 09:43 Duty Admin end 'warning' in GOCDB.
Mon. 1st September 10:56 (FT) Replacement of stack 5b switch unit (5510-48T) completed, stack complete, connectivity restored as systems connected back to correct ports. Cacti NIC reports up, resumes monitoring of all stacks with which it had lost contact. No loss of contact with management network (10.0.0.0/8) because it uses a different NIC.

Incident details

There was a series of callouts in the early hours of Saturday 30th August. These affected a set of machines - not all of which were critical for production services. However, the details of the problem were unclear during the night. Of note is that one of the systems that became unavailable was the Cacti network monitor. During the night the Primary On-Call dealt with the set of problems on a case per case basis - understanding the significance of various system failures whilst suspecting some network problem. Also during the night the Database Team on-call was contacted and identified a problem accessing a disk area used for backups.

The following morning the next Primary On-Call reviewed the situation and identified the most likely cause to be the failure of a network switch. The Fabric Team on-call person came onto site to investigate and found a faulty switch (switch 5b in the 'HPD' room). Important systems connected via this switch were reconnected to other switches in the same switch stack, it not being possible to quickly replace the switch at that time. This restored connectivity to the important machines on that switch.

However, problems were appearing on other systems that run as virtual machines. These VMs are hosted on hypervisors that are not connected via the failed switch. A request to the Fabric On-Call to check the switch stack supporting the Hypervisors (located in the UPS room) revealed a further problem. A reset of that switch stack (stack 9) restored connectivity to some, but not all VMs.

Many of those VMs that were not available across the network were contactable via the hypervisors and restarting their network interfaces restored their connectivity (and hence the services they provided). Examples of such machines were the ARC CEs and Myproxy server. However, the problems only affected some of the VMs - not all VMs on any given hypervisor were affected.

One of the virtual machines for which connectivity had failed is quattor-deploy01 which provides a DHCP server. Re-starting the network interface on this system did not restore its connectivity, but migrating it to another hypervisor enabled it to work. Once this machine was up most of the remaining systems (both physical and virtual) that were down were able to restart.

Analysis

Subsequent analysis, partly based on the logs from the CACTI network monitor, shows that the failure of a switch (switch 5b, in network stack 5) some time shortly after midnight on the 30th August was the cause of the initial loss of connectivity to some systems.

However, connectivity to a number of other systems (mainly virtual machines) which are not connected to this network switch (or stack) was also lost. Invetsigation by staff on site showed a problem with a separate network stack - and resetting of that stack restored connectivity to some of the affected systems connected via this stack.

There is no apparent connection between these two network switches/stacks. Other systems in the machine room did not report problems and it appears these were coincidental events.

Only some of the VMs were affected by these events - not all VMs on any given hypervisor were affected. The pattern of which machines were/weren't affected remains to be understood as does why a problem in the switch stack caused the problems seen on the VMs. [Note added later: A few weeks later a handful of VMs, again spread across several hypervisors, lost their network connectivity. In this case all of the affected VMs were configured to use the "emulated" network interface.]

Having resolved the network switch/stack problems some machines were still not available. Once the DHCP server was restarted these systems again became available. The normal configuration of the Tier1 systems is that DHCP is not required apart from when installing a system. However, a configuration error had arisen that affected a subset of systems which then required DHCP when their networking was reset (or the system restarted). At the time of the review this problem is understood but the fix is not rolled out everywhere yet.

Follow Up

Issue Response Done
There was an unnecessary, and potentially confusing, callout for the standby Argus server (Argus02). Validate procedures for switching Argus servers to ensure Callouts correctly updated. No
It was not clear that the system used to host the Database Backups was unavailable. Check the monitoring of this system. No
Investigate the cause of some VMs having network problems (but not all VMs) Check which systems use the 'synthetic' and which 'emulated' network interfaces. No
It was not possibly to quickly replace the failed switch in stack 5. This was worked around by re-connecting critical systems to spare ports in other switches in the stack. Check, and improve where necessary, the status of networking spares. No
Some systems required DHCP to recover from the problem. Ensure a fix to the configuration error that caused this is rolled out across all affected systems. No
There was not direct indication from the Nagios based monitoring that there was a problem with network switches. Review and improve the Nagios monitoring of network switches. No
There is no time synchronization across network devices such as disk arrays and switches. This would make cross-correlation of such logs difficult. A ntp server exists on the management network. Review and configure ntp on these devices. No

Reported by: Gareth Smith 14th September 2014

Summary Table

Start Date 30 August 2014
Impact >50%
Duration of Outage Approximately 14 hours
Status Open
Root Cause Hardware compounded by Configuration Error
Data Loss No