Difference between revisions of "RAL Tier1 Incident 20140830 Network Related Problems"
(→Timeline of the Incident) |
|||
Line 29: | Line 29: | ||
| Sat. 30th August 01:45 | | Sat. 30th August 01:45 | ||
| Callout on argus02 (Socket timeout after 90 seconds.). PoC subsequently reboots that node. | | Callout on argus02 (Socket timeout after 90 seconds.). PoC subsequently reboots that node. | ||
+ | |- | ||
+ | | Sat. 30th August 02:10 | ||
+ | | Callout for "Tier1_service_db_system_load_-_night_on_host_lcgdb05". | ||
|- | |- | ||
| Sat. 30th August 02:12 | | Sat. 30th August 02:12 | ||
| PoC recods tests clearing after argus02 reboot. | | PoC recods tests clearing after argus02 reboot. | ||
+ | |- | ||
+ | | Sat. 30th August 02:29 | ||
+ | | PoC conbtacts DB Team on-call | ||
|- | |- | ||
| Sat. 30th August 09:49 | | Sat. 30th August 09:49 | ||
Line 51: | Line 57: | ||
|} | |} | ||
− | |||
− | |||
===Incident details=== | ===Incident details=== |
Revision as of 12:54, 4 September 2014
Contents
RAL-LCG2 Incident 20130830 Network Related Problems
The on-call team was called out for a number of separate callouts. These were traced to a failed network switch. Staf attended on site and this was worked around. However, a number of other problems became apparant which were affecting VMs on the Microsoft Hyper-V virtualistion platform.
Impact
Describe the type of impact. Include which services / VOs. How long they were impacted for and give the dates. If data loss ensure this is clearly flagged.
The problem initally appeared at around ???? The Tier1 site (apart from Castor storage) was declared down in the GOC DB for 5.5hours.
Timeline of the Incident
When | What |
---|---|
Date & maybe time e.g. 20th July 09:00 | Blah Team did something |
Sat. 30th August 00:29 | First callout "Check_logging_on_system_loggers_on_host_logger2" (Connection refused or timed out). This was followed by another callout on logger 2 and one for host gdss142" (also Connection refused or timed out). |
Sat. 30th August 00:57 | Primary On-Call (PoC) starts working. Concludes teh systems notified (one of teh loggers and the soon-to-be-retired software server gdss142) can wait until morning. |
Sat. 30th August 01:45 | Callout on argus02 (Socket timeout after 90 seconds.). PoC subsequently reboots that node. |
Sat. 30th August 02:10 | Callout for "Tier1_service_db_system_load_-_night_on_host_lcgdb05". |
Sat. 30th August 02:12 | PoC recods tests clearing after argus02 reboot. |
Sat. 30th August 02:29 | PoC conbtacts DB Team on-call |
Sat. 30th August 09:49 | Next PoC on duty concludes most likely a network hardware problem and contacts Fabric Team On-call. |
|
Incident details
Put a reasonably detailed description of the incident here.
Analysis
This section to include a breakdown of what happened. Include any related issues.
Follow Up
This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.
Issue | Response | Done |
---|---|---|
Issue 1 | Mitigation for issue 1. | Done yes/no |
Issue 2 | Mitigation for issue 2. | Done yes/no |
Related issues
List any related issue and provide links if possible. If there are none then remove this section.
Reported by: Your Name at date/time
Summary Table
Start Date | 30 August 20104 |
Impact | Select one of: >80%, >50%, >20%, <20% |
Duration of Outage | Hours e.g. 3hours |
Status | Draft |
Root Cause | Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load |
Data Loss | No |