RAL Tier1 Incident 20101207 Network Outage

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 7th December 2010: Six Hour Network Outage and loss of service.

Description:

An outage of the RAL Site Access Router effectively took the RAL Tier1 Off-air from 06:15 to 12:15 on Tuesday 7th December 2010.

Impact

The whole site was unavailable. Internal work carried on (batch jobs were able to access the local Castor storage) but external access was blocked.

Timeline of the Incident

When What
7th Dec 2010 06:15 Initial network problem breaks external connectivity.
7th Dec 2010 09:30 Tier1 staff attempting to get message out about site unavailability.
7th Dec 2010 12:15 Problem on SAR fixed.
7th Dec 2010 13:00 Some resulting problems with DNS and BDII being resolved.
7th Dec 2010 14:30 Broadcast message sent stating RAL Tier1 back online.

Incident details

A failure of the Site Access Router disconnected the RAL site from the internet. Connectivity over the OPN remained in place, but all other traffic was blocked. The blocking of control information etc. effectively took the RAL Tier1 site off air.

There was a difficulty in informing VOs of the outage caused by problems obtaining mobile phone internet connections and incorrect (out of date) contact telephone numbers.

On restoring the connectivity there were some further problems on site with the DNS servers and locally with the Tier1 BDIIs.

Analysis

The failure of the Site Access Router blocked all traffic onto and off the RAL site. The exception was was data traffic between Castor Disk Servers and other sites over the OPN. However, as all non-data flow traffic (including control information) was blocked the Tier1 was off-air for the duration of the network outage.

The network access to site is managed by a networking group that is separate from the Tier1 team.

Problems were encountered in getting information about the failure out to the VOs. The usual communications methods did not work:

  • Local RAL e-mail would not go off-site.
  • Mobile phone network connectivity was found to be very erratic (possibly high load at the time).
  • The web page for the EGI broadcasts (CIC portal) was accessed but it was not possible to send a broadcast. This is possibly an effect of the GOC DB, also located at RAL, being unreachable from the CIC portal during the network break.
  • E-mail lists were not available as some based at RAL and others not accepting mail from users' personal accounts.
  • Some locally documented phone numbers, such as the WLCG emergency contact number, no longer being valid.

Information finally released by telephone asking colleagues elsewhere to pass on the message.

Follow Up

Issue Response Done
Difficulty getting message out to VOs about site off-air. Review the communications plan and validate contact numbers. Instigate regular (six-monthly) review of the plan including validation of contact details. Yes
Unable to send EGI Broadcast from portal. Follow up with those responsible for the portal to confirm cause of the failure and request improvement. Note added following review on 28/06/11. There have been many changes to the portal. This item is now obsolete. N/A

Reported by: Gareth Smith 23rd December 2010

Summary Table

Start Date 7th December 2010
Impact >80%
Duration of Outage 6 hours
Status Closed
Root Cause Network
Data Loss No