Difference between revisions of "RAL Tier1 Incident 20140830 Network Related Problems"

From GridPP Wiki
Jump to: navigation, search
(Timeline of the Incident)
Line 29: Line 29:
 
| Sat. 30th August 01:45
 
| Sat. 30th August 01:45
 
| Callout on argus02 (Socket timeout after 90 seconds.). PoC subsequently reboots that node.
 
| Callout on argus02 (Socket timeout after 90 seconds.). PoC subsequently reboots that node.
 +
|-
 +
| Sat. 30th August 02:10
 +
| Callout for "Tier1_service_db_system_load_-_night_on_host_lcgdb05".
 
|-
 
|-
 
| Sat. 30th August 02:12
 
| Sat. 30th August 02:12
 
| PoC recods tests clearing after argus02 reboot.
 
| PoC recods tests clearing after argus02 reboot.
 +
|-
 +
| Sat. 30th August 02:29
 +
| PoC conbtacts DB Team on-call
 
|-
 
|-
 
| Sat. 30th August 09:49
 
| Sat. 30th August 09:49
Line 51: Line 57:
  
 
|}
 
|}
 
 
  
 
===Incident details===
 
===Incident details===

Revision as of 12:54, 4 September 2014

RAL-LCG2 Incident 20130830 Network Related Problems

The on-call team was called out for a number of separate callouts. These were traced to a failed network switch. Staf attended on site and this was worked around. However, a number of other problems became apparant which were affecting VMs on the Microsoft Hyper-V virtualistion platform.

Impact

Describe the type of impact. Include which services / VOs. How long they were impacted for and give the dates. If data loss ensure this is clearly flagged.

The problem initally appeared at around ???? The Tier1 site (apart from Castor storage) was declared down in the GOC DB for 5.5hours.


Timeline of the Incident

When What
Date & maybe time e.g. 20th July 09:00 Blah Team did something
Sat. 30th August 00:29 First callout "Check_logging_on_system_loggers_on_host_logger2" (Connection refused or timed out). This was followed by another callout on logger 2 and one for host gdss142" (also Connection refused or timed out).
Sat. 30th August 00:57 Primary On-Call (PoC) starts working. Concludes teh systems notified (one of teh loggers and the soon-to-be-retired software server gdss142) can wait until morning.
Sat. 30th August 01:45 Callout on argus02 (Socket timeout after 90 seconds.). PoC subsequently reboots that node.
Sat. 30th August 02:10 Callout for "Tier1_service_db_system_load_-_night_on_host_lcgdb05".
Sat. 30th August 02:12 PoC recods tests clearing after argus02 reboot.
Sat. 30th August 02:29 PoC conbtacts DB Team on-call
Sat. 30th August 09:49 Next PoC on duty concludes most likely a network hardware problem and contacts Fabric Team On-call.


Incident details

Put a reasonably detailed description of the incident here.


Analysis

This section to include a breakdown of what happened. Include any related issues.


Follow Up

This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.


Issue Response Done
Issue 1 Mitigation for issue 1. Done yes/no
Issue 2 Mitigation for issue 2. Done yes/no

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by: Your Name at date/time

Summary Table

Start Date 30 August 20104
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status Draft
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss No