RAL Tier1 Incident 20140830 Network Related Problems

From GridPP Wiki
Jump to: navigation, search

RAL-LCG2 Incident 20130830 Network Related Problems

The on-call team was called out for a number of separate callouts. These were traced to a failed network switch. Staf attended on site and this was worked around. However, a number of other problems became apparant which were affecting VMs on the Microsoft Hyper-V virtualistion platform.

Impact

Describe the type of impact. Include which services / VOs. How long they were impacted for and give the dates. If data loss ensure this is clearly flagged.

The problem initally appeared at around half past midnight on the morning of Saturday 30th August. .....

The Tier1 site (apart from Castor storage) was declared down in the GOC DB for 5.5hours.

Timeline of the Incident

When What
Date & maybe time e.g. 20th July 09:00 Blah Team did something
Saturday 30 August 00:00 Last recorded data by Cacti from many stacks with 172.16.180.0/21 addresses before problem started. No contact with cacti till 11:00 Mon 1st Sept. Assume switch stack unit 5b died somewhere after 00:00.
Saturday 30 August 00:15:08 Cacti host NIC down. Probably time of switch 5b death. Cacti last previous NTP sync at 23:03 on 29th so time likely to be accurate.
Sat. 30th August 00:29 First callout "Check_logging_on_system_loggers_on_host_logger2" (Connection refused or timed out). This was followed by another callout on logger 2 and one for host gdss142 (also Connection refused or timed out).
Sat. 30th August 00:57 Primary On-Call (PoC) starts working. Concludes the systems notified (one of the loggers and the soon-to-be-retired software server gdss142) can wait until morning.
Sat. 30th August 01:10 PoC suspects a network problem, probably affecting only a part of the services.
Sat. 30th August 01:45 Callout on argus02 (Socket timeout after 90 seconds). PoC subsequently reboots that node.
Sat. 30th August 02:10 Callout for "Tier1_service_db_system_load_-_night_on_host_lcgdb05".
Sat. 30th August 02:12 PoC records tests clearing after argus02 reboot.
Sat. 30th August 02:29 PoC contacts DB Team on-call
Sat. 30th August 09:00 Start of unscheduled 'outage' in the GCDB for the whole Tier1 apart from Castor.
Sat. 30th August 09:49 Next PoC on duty concludes most likely a network hardware problem and contacts Fabric Team On-call.
Sat. 30th August ~10:53 (FT-OC) Stack 9 reset. No record in Stack 9 logs of issues prior to this.
Sat. 30th August 14:24 End of unscheduled 'outage' in GOCDB. Converted to a 'warning'.
Mon. 1st September 09:43 Duty Admin end 'warning' in GOCDB.
Mon. 1st September 10:56 (FT) Replacement of stack 5b switch unit (5510-48T) completed, stack complete, connectivity restored as systems connected back to correct ports. Cacti NIC reports up, resumes monitoring of all stacks with which it had lost contact. No loss of contact with management network (10.0.0.0/8) because it uses a different NIC.

Incident details

Put a reasonably detailed description of the incident here.

Notes: Stack 5 is in the HPD room. Logger? (whichever had problems) is on stack 5, as was/is Cacti (this latter into switch 5b)

Stack 9 is in the UPS room.

Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date 30 August 20104
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status Draft
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss No