7th August - Oxford Nagios.

From GridPP Wiki
Jump to: navigation, search

Network failure took gridppnagios.physics.ox.ac.uk offline

Site: UKI-SOUTHGRID-OX-HEP

Incident Date: 2010-08-07

Severity: Bad

Service: UKI Regional Nagios

Impacted: Regional monitoring for all UKI sites

Incident Summary: Failure of an Oxford University DNS server caused a failure of services within the Physics network

Type of Impact: No access to monitoring results, no monitoring tests run

Incident duration: Two days

Report date: 2010-08-09

Incident Overview:

One of several DNS servers run by Oxford University Computing Services (OUCS) stopped responding to queries on the morning of Saturday 7th August 2010. The caused slow DNS resolution on multiple Oxford Physics machines which had this server configured as one of several available DNS servers, causing some DNS dependent services to either run slowly, or fail with timeouts. A particularly acute effect was felt on a pair of DHCP servers, both of which were running ISC dhcpd 3.0.1-65 as supplied in SL4. For reasons which are not yet understood the DHCP servers stopped responding to DHCP requests from some client machines causing them to drop their network connections when the leases expired. This took several systems off the network including:

  • gridppnagios - Regional nagios server, running on harware,
  • t2myproxy - MyProxy server used to provide long-lived certificates to gridppnagios, running on a VM,
  • t2hn03 - VM hosting box running the t2myproxy VM

Networking was restored to gridppnagios on the afternoon of Saturday 7th by using its remote management card to log in and configure the network details statically, and thereby remove the dependency on the DHCP service. It was not possible to get remote access to either the t2myproxy VM or the t2hn03 host system. On Sunday the 8th physical access was made to the grid systems and full service was restored by about six o clock in the afternoon.