Difference between revisions of "RAL Tier1 Incident 20090309 DNS problems caused service outage"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 09:36, 16 March 2009

Site: RAL-LCG2

Incident Date: 2009-03-09 & 10

Severity: Field not defined yet

Service: All, but especially SRMs

Impacted: Mainly Atlas and LHCb

Incident Summary: A failure of one of the DNS systems that was used as a primary by many nodes caused time-outs and other problems across Tier1 services.

Type of Impact: Degraded

Incident duration: 24hours (approx)

Report date: 2009-03-11

Reported by: Gareth Smith

Related URLs: None

Incident details:


Date Time Who/What Entry
2009-03-09 Mid morning S. De Witt Problems noticed with Castor. High latency in logging into boxes.
2009-03-09 Late morning C. Kruk Problems with Castor LSF after a reconfiguration with nothing starting up.
2009-03-09 Late morning M.Bly S. DeWitt AOD(C. Condurache) Shaun spoke with Martin to ask if problem with network: Martin informed Shaun the primary DNS had a problem.

Gradual deterioration in SAM tests. Shaun reported this DNS issue to Gareth/John. AoD (Catalin) kept informed.

2009-03-09 Early Afternoon G. Smith Confirmed with local contact (N. Hill) that there was a known problem. On DNS server with address 130.246.8.13. the daemon would not stay up for more than a couple of minutes.
2009-03-09 14:37 G. Smith Whole site put into unscheduled 'At Risk' until 17:00. E-mail sent to gridpp_users and Atlas_UK_comp_operations.
2009-03-09 17:00 G.Smith No progress reported on DNS problems. Spoke with site network team. Became clear they assumed we could use the secondary and they were having difficulty diagnosing the problem.
2009-03-09 17:15 M. Bly Spoke at length with Nick Moore from ~17:15 - clear that they would be unable to diagnose the problem that evening even with his assistance. DNS left with and auto-restart-if-down going every 2 minutes.
2009-03-09 17:18 J.Kelly G.Smith New 'At Risk' declared in GOC DB for whole site until 11am tomorrow. However, mistakenly marked as scheduled.
2009-03-10 08:53 A. Sansum Andrew proposed dry run of Disaster procedure - convening team at 13:00 today.
2009-03-10 08:00 approx M. Bly Martin begins work with Network group to diagnose DNS issue. Issues with OS version and lack of tools available. Quickly clear once diagnostics working that a host elsewhere on site was flooding the DNS server. Steps taken to lock out errant host prove successful. Unable to positively identify and terminate the host at that time.
2009-03-10 09:18 M. Bly Problem on DNS server resolved.
2009-03-10 09:35 M. Bly Fix to DNS server confirmed.
2009-03-10 10:45 S. DeWitt C. Kruk Problems with Atlas Castor instance resolved via restart.
2009-03-10 10:48 G. Smith At Risk ended in GOC DB.
2009-03-10 11:19 A. Sansum Stand down Disaster Plan.


Future mitigation:

  • Review DNS configuration on Tier1. This review to include:
    • Client configurations on systems and methods by which they may be updated: Getting the resolver to use the multiple resilient DNS servers already configured (we had three but failed to use them). E.g. having a more dynamic system to deconfigure DNS servers from our configuration
    • Ensuring that we are doing local caching properly
    • Consider running our own DNS service that we can fix faster (but if we do this we should be certain that we can do so out of hours). E.g. Consider having a Tier1-managed slave DNS (like chilton/130.246.8.13 et al) in the Tier1 subnet which we use as a primary.
  • Review interaction with Networking. This review to include:
    • How we obtain information from them (and feed back issues). Networking have a formal system for broadcasting info about problems (the villages list). From where the departmental representatives should broadcast it.


Related issues:

  • The site master DNS servers were not affected by this problem.
  • No single primary DNS used on Tier1 - it differs on different clases of host. The 'official' primary is 130.246.8.13 but the batch workers and several other systems use 130.246.72.21 as (incorrectly) instructed some years back. The disadvantage of using the general site primary is that it gets a lot of requests from everyone else (outside the Tier1) and is thus more prone to issues.
  • No formal notification of the DNS problems was received.
  • Until 17:00 there was no contact directly with Networking team to both pass back the level of our problems and to understand theirs.
  • Whilst some systems were showing intermittent failures, the degradation of some services was possibly sufficient to warrant an Outage on some of the services.
  • The setting of the overnight "At Risk" as 'scheduled' (rather than unscheduled) was a mistake.
  • Are we particularly exposed to this type of DNS problem? The DNS is fundamental to our operation - any issues are a problem, not just flooding. A downed primary that goes unnoticed for any length of time would cause similar issues.
  • The root cause of the DNS server struggling was a mis-configured system in another department that was bombarding it with packets. The DNS overload issue has hit networking before and they in part suspected this was the case on this occasion, though were unable to verify this initially.

Timeline

Date Time Comment
Actually Started 2009-03-09 Mid morning First indications of something wrong.
Fault first detected 2009-03-09 Late morning Received news of state of DNS server.
First Advisory Issued 2009-03-09 14:37 Entry in GOCDB and e-mail to GridPP_users and Atlas_uk_comp_operations lists.
First Intervention 2009-03-09 17:15 First attempt to work with Networks Team. (Although they had been working on it beforehand).
Fault Fixed 2009-03-10 09:18 DNS server problem resolved, although there were subsequent follow on problems (triggered by the original fault) within Castor.
Announced as Fixed 2009-03-10 10:48 End outage in GOCDB and e-mail to GridPP_users and Atlas_uk_comp_operations lists.
Downtime(s) Logged in GOCDB 2009-03-09 & 10 At Risk: Unscheduled from 2009-03-09 14:37 to 15:00. Extended (although marked as scheduled) until 2009-03-10 11:00. This was then terminated at 10:48.
Other Advisories Issued 2009-03-09 17:22 E-mail to GridPP_users and Atlas_uk_comp_operations lists.