Manchester Incident 20100227

From GridPP Wiki
Jump to: navigation, search

Site: Manchester (UKI-NORTHGRID-MAN-HEP)

Incident Date: 28/02/2010

Severity: Field not defined yet

Service: gridpp.ac.uk DNS

Impacted: GridPP sites using gridpp.ac.uk; Users of Manchester VOMS

Incident Summary: On Saturday 28th February several sites noticed problems contacting the GridPP website and that SAM tests were failing. It was quickly noted that the gridpp.ac.uk DNS was down and odd findings reported when lookups were performed of the authoritative DNS. The matter was reported on TB-SUPPORT and via GGUS.

Type of Impact: Service down

Incident duration:

Report date: 05/03/2010

Reported by: Andrew McNab

Related URLs: N/A

Incident details: For historical reasons, a set of GridPP/NGS WWW + VOMS machines were operated in the Manchester Tier-2 centre as a standalone service with their own IP subnet and DNS nameservers. We have been in the final stages of merging these services with others in Manchester, and the GridPP web and DNS were the last to be done (by Andrew McNab.)

Before this was finished, due to failure of one of the DNS servers the number of authoritative nameservers for gridpp.ac.uk was reduced to one (the one on the WWW server itself), and this machine itself stopped responding to DNS requests sometime after the morning of Saturday 27th February (although it did continue to respond to IP pings) and was found to have had a "kernel panic" when examined on Monday 1st March.

The major result of this was that the CNAME alias lcg-bdii.gridpp.ac.uk which points to lcgbdii.gridpp.rl.ac.uk began to expire from caching DNS servers at other sites and jobs requiring access to the BDII to change state failed.

After restarting the server on Monday, the expiration time of the lcg-bdii.gridpp.ac.uk entry was increased to 3 days as an immediate measure, and the Tier-2 DNS servers were configured to serve the gridpp.ac.uk domain instead. These are now acting as the authoritative name servers for gridpp.ac.uk (and for the associated gpp.hep.man.ac.uk domain that it depends on.)

Future mitigation:

We have contacted Manchester IT services about improving the resilience of the DNS service and we will also set up at least one other DNS secondary at another UK site outside of Net North West.

For the WWW service itself, we have gone ahead with manual installation of the necessary software and the data to a replacement machine which supports remote hard reboots, rather than finish the fully automated installation in the short term (which we were preparing before last week's problems.)

Related issues:

None.

Timeline

Date Time Comment
Actually Started When did the fault actually start
Fault first detected Nagios/Admin/User ... etc
First Advisory Issued How/To who
First Intervention When you first tried to intervene
Fault Fixed When was the problem resolved
Announced as Fixed How, to who
Downtime(s) Logged in GOCDB at risk/unscheduled down (what components/VOs) repeat as necessary
Other Advisories Issued Where etc repeat as necessary