7th August - Oxford Nagios
Contents
UKI-SOUTHGRID-OX-HEP - Nagios outage
Description:
Network failure took gridppnagios.physics.ox.ac.uk offline
Impact
Regional monitoring for all UKI sites
Timeline of the Incident
When | What |
---|---|
07 Aug 2010 | ' |
Incident details
Failure of an Oxford University DNS server caused a failure of services within the Physics network
Analysis
One of several DNS servers run by Oxford University Computing Services (OUCS) stopped responding to queries on the morning of Saturday 7th August 2010. The caused slow DNS resolution on multiple Oxford Physics machines which had this server configured as one of several available DNS servers, causing some DNS dependent services to either run slowly, or fail with timeouts. A particularly acute effect was felt on a pair of DHCP servers, both of which were running ISC dhcpd 3.0.1-65 as supplied in SL4. For reasons which are not yet understood the DHCP servers stopped responding to DHCP requests from some client machines causing them to drop their network connections when the leases expired. This took several systems off the network including:
* gridppnagios - Regional nagios server, running on harware, * t2myproxy - MyProxy server used to provide long-lived certificates to gridppnagios, running on a VM, * t2hn03 - VM hosting box running the t2myproxy VM
Networking was restored to gridppnagios on the afternoon of Saturday 7th by using its remote management card to log in and configure the network details statically, and thereby remove the dependency on the DHCP service. It was not possible to get remote access to either the t2myproxy VM or the t2hn03 host system. On Sunday the 8th physical access was made to the grid systems and full service was restored by about six o clock in the afternoon.
Follow Up
To minimize the dependency on Oxford site, the first step is to use an existing myproxy server at different site (Preferably RAL) as fall back option.
We are already backing up all databases so in case of hardware failure, it is possible to install and configure a new regional nagios in few hour but the long time plan is to have a fail-over regional nagios installation at some other site
Issue | Response |
---|---|
Issue 1 | Mitigation for issue 1. |
Issue 2 | Mitigation for issue 2. |
Related issues
List any related issue and provide links if possible. If there are none then remove this section.
Reported by:
Ewan MacMahon at 10 Aug 2010
Kashif Mohammad at 11 Aug 2010
Summary Table
Start Date | Date e.g. 20 July 2010 |
Impact | Select one of: >80%, >50%, >20%, <20% |
Duration of Outage | Hours e.g. 3hours |
Status | select one from Open, Understood, Closed |
Root Cause | Select one from Unknown, Software Bug, Hardware, Configuration Error, Human Error, Network, User Load |
Data Loss | Yes/No |