7th August - Oxford Nagios

From GridPP Wiki
Jump to: navigation, search

UKI-SOUTHGRID-OX-HEP - Nagios outage

Description:

Network failure took gridppnagios.physics.ox.ac.uk offline

Impact

Regional monitoring for all UKI sites

Timeline of the Incident

When What
07 Aug 2010 '

Incident details

Failure of an Oxford University DNS server caused a failure of services within the Physics network

Analysis

One of several DNS servers run by Oxford University Computing Services (OUCS) stopped responding to queries on the morning of Saturday 7th August 2010. The caused slow DNS resolution on multiple Oxford Physics machines which had this server configured as one of several available DNS servers, causing some DNS dependent services to either run slowly, or fail with timeouts. A particularly acute effect was felt on a pair of DHCP servers, both of which were running ISC dhcpd 3.0.1-65 as supplied in SL4. For reasons which are not yet understood the DHCP servers stopped responding to DHCP requests from some client machines causing them to drop their network connections when the leases expired. This took several systems off the network including:

   * gridppnagios - Regional nagios server, running on harware,
   * t2myproxy - MyProxy server used to provide long-lived certificates to gridppnagios, running on a VM,
   * t2hn03 - VM hosting box running the t2myproxy VM 

Networking was restored to gridppnagios on the afternoon of Saturday 7th by using its remote management card to log in and configure the network details statically, and thereby remove the dependency on the DHCP service. It was not possible to get remote access to either the t2myproxy VM or the t2hn03 host system. On Sunday the 8th physical access was made to the grid systems and full service was restored by about six o clock in the afternoon.

Follow Up

To minimize the dependency on Oxford site, the first step is to use an existing myproxy server at different site (Preferably RAL) as fall back option.

We are already backing up all databases so in case of hardware failure, it is possible to install and configure a new regional nagios in few hour but the long time plan is to have a fail-over regional nagios installation at some other site


Issue Response
Issue 1 Mitigation for issue 1.
Issue 2 Mitigation for issue 2.

Related issues

List any related issue and provide links if possible. If there are none then remove this section.


Reported by:

Ewan MacMahon at 10 Aug 2010

Kashif Mohammad at 11 Aug 2010

Summary Table

Start Date Date e.g. 20 July 2010
Impact Select one of: >80%, >50%, >20%, <20%
Duration of Outage Hours e.g. 3hours
Status select one from Open, Understood, Closed
Root Cause Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load
Data Loss Yes/No