Difference between revisions of "7th August - Oxford Nagios."
(No difference)
|
Latest revision as of 14:01, 9 August 2010
Network failure took gridppnagios.physics.ox.ac.uk offline
Site: UKI-SOUTHGRID-OX-HEP
Incident Date: 2010-08-07
Severity: Bad
Service: UKI Regional Nagios
Impacted: Regional monitoring for all UKI sites
Incident Summary: Failure of an Oxford University DNS server caused a failure of services within the Physics network
Type of Impact: No access to monitoring results, no monitoring tests run
Incident duration: Two days
Report date: 2010-08-09
Incident Overview:
One of several DNS servers run by Oxford University Computing Services (OUCS) stopped responding to queries on the morning of Saturday 7th August 2010. The caused slow DNS resolution on multiple Oxford Physics machines which had this server configured as one of several available DNS servers, causing some DNS dependent services to either run slowly, or fail with timeouts. A particularly acute effect was felt on a pair of DHCP servers, both of which were running ISC dhcpd 3.0.1-65 as supplied in SL4. For reasons which are not yet understood the DHCP servers stopped responding to DHCP requests from some client machines causing them to drop their network connections when the leases expired. This took several systems off the network including:
- gridppnagios - Regional nagios server, running on harware,
- t2myproxy - MyProxy server used to provide long-lived certificates to gridppnagios, running on a VM,
- t2hn03 - VM hosting box running the t2myproxy VM
Networking was restored to gridppnagios on the afternoon of Saturday 7th by using its remote management card to log in and configure the network details statically, and thereby remove the dependency on the DHCP service. It was not possible to get remote access to either the t2myproxy VM or the t2hn03 host system. On Sunday the 8th physical access was made to the grid systems and full service was restored by about six o clock in the afternoon.