Difference between revisions of "RAL Tier1 Incident 20140830 Network Related Problems"
(Created page with "==RAL-LCG2 Incident 20130830 Network Related Problems == ===Description:=== ''Put here a reasonable description of the event. Ensure you include which service etc. is affect...") |
|||
Line 1: | Line 1: | ||
− | ==RAL-LCG2 Incident 20130830 Network Related Problems | + | ==RAL-LCG2 Incident 20130830 Network Related Problems == |
− | + | The on-call team was called out for a number of separate callouts. These were traced to a failed network switch. Staf attended on site and this was worked around. However, a number of other problems became apparant which were affecting VMs on the Microsoft Hyper-V virtualistion platform. | |
===Impact=== | ===Impact=== | ||
''Describe the type of impact. Include which services / VOs. How long they were impacted for and give the dates. If data loss ensure this is clearly flagged.'' | ''Describe the type of impact. Include which services / VOs. How long they were impacted for and give the dates. If data loss ensure this is clearly flagged.'' | ||
+ | |||
+ | The problem initally appeared at around ???? The Tier1 site (apart from Castor storage) was declared down in the GOC DB for 5.5hours. | ||
Line 18: | Line 20: | ||
| ''Date & maybe time e.g. 20th July 09:00'' | | ''Date & maybe time e.g. 20th July 09:00'' | ||
| ''Blah Team did something'' | | ''Blah Team did something'' | ||
+ | |- | ||
+ | | Sat. 30th August 00:29 | ||
+ | | First callout "Check_logging_on_system_loggers_on_host_logger2" (Connection refused or timed out). This was followed by another callout on logger 2 and one for host gdss142" (also Connection refused or timed out). | ||
+ | |- | ||
+ | | Sat. 30th August 00:57 | ||
+ | | Primary On-Call (PoC) starts working. Concludes teh systems notified (one of teh loggers and the soon-to-be-retired software server gdss142) can wait until morning. | ||
+ | |- | ||
+ | | Sat. 30th August 01:45 | ||
+ | | Callout on argus02 (Socket timeout after 90 seconds.). PoC subsequently reboots that node. | ||
+ | |- | ||
+ | | Sat. 30th August 02:12 | ||
+ | | PoC recods tests clearing after argus02 reboot. | ||
+ | |- | ||
+ | | Sat. 30th August 09:49 | ||
+ | | Next PoC on duty concludes most likely a network hardware problem and contacts Fabric Team On-call. | ||
+ | |||
+ | |- | ||
+ | | | ||
+ | | | ||
+ | |||
+ | |- | ||
+ | | | ||
+ | | | ||
+ | |||
+ | |- | ||
+ | | | ||
+ | | | ||
+ | |||
+ | |||
+ | |||
|} | |} | ||
Line 68: | Line 100: | ||
|- | |- | ||
| Start Date | | Start Date | ||
− | | | + | | 30 August 20104 |
|- | |- | ||
| Impact | | Impact | ||
Line 77: | Line 109: | ||
|- | |- | ||
| Status | | Status | ||
− | | | + | | Draft |
|- | |- | ||
| Root Cause | | Root Cause | ||
Line 83: | Line 115: | ||
|- | |- | ||
| Data Loss | | Data Loss | ||
− | | | + | | No |
|} | |} | ||
[[Category:Incidents]] | [[Category:Incidents]] |
Revision as of 12:51, 4 September 2014
Contents
RAL-LCG2 Incident 20130830 Network Related Problems
The on-call team was called out for a number of separate callouts. These were traced to a failed network switch. Staf attended on site and this was worked around. However, a number of other problems became apparant which were affecting VMs on the Microsoft Hyper-V virtualistion platform.
Impact
Describe the type of impact. Include which services / VOs. How long they were impacted for and give the dates. If data loss ensure this is clearly flagged.
The problem initally appeared at around ???? The Tier1 site (apart from Castor storage) was declared down in the GOC DB for 5.5hours.
Timeline of the Incident
When | What |
---|---|
Date & maybe time e.g. 20th July 09:00 | Blah Team did something |
Sat. 30th August 00:29 | First callout "Check_logging_on_system_loggers_on_host_logger2" (Connection refused or timed out). This was followed by another callout on logger 2 and one for host gdss142" (also Connection refused or timed out). |
Sat. 30th August 00:57 | Primary On-Call (PoC) starts working. Concludes teh systems notified (one of teh loggers and the soon-to-be-retired software server gdss142) can wait until morning. |
Sat. 30th August 01:45 | Callout on argus02 (Socket timeout after 90 seconds.). PoC subsequently reboots that node. |
Sat. 30th August 02:12 | PoC recods tests clearing after argus02 reboot. |
Sat. 30th August 09:49 | Next PoC on duty concludes most likely a network hardware problem and contacts Fabric Team On-call. |
|
Incident details
Put a reasonably detailed description of the incident here.
Analysis
This section to include a breakdown of what happened. Include any related issues.
Follow Up
This is what we used to call future mitigation. Include specific points to be done. It is not necessary to use the table below, but may be easier to do so.
Issue | Response | Done |
---|---|---|
Issue 1 | Mitigation for issue 1. | Done yes/no |
Issue 2 | Mitigation for issue 2. | Done yes/no |
Related issues
List any related issue and provide links if possible. If there are none then remove this section.
Reported by: Your Name at date/time
Summary Table
Start Date | 30 August 20104 |
Impact | Select one of: >80%, >50%, >20%, <20% |
Duration of Outage | Hours e.g. 3hours |
Status | Draft |
Root Cause | Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load |
Data Loss | No |