Difference between revisions of "RAL Tier1 Incident 20090309 DNS problems caused service outage"

Latest revision as of 09:36, 16 March 2009

Site: RAL-LCG2

Incident Date: 2009-03-09 & 10

Severity: Field not defined yet

Service: All, but especially SRMs

Impacted: Mainly Atlas and LHCb

Incident Summary: A failure of one of the DNS systems that was used as a primary by many nodes caused time-outs and other problems across Tier1 services.

Type of Impact: Degraded

Incident duration: 24hours (approx)

Report date: 2009-03-11

Reported by: Gareth Smith

Related URLs: None

Incident details:


Date	Time	Who/What	Entry
2009-03-09	Mid morning	S. De Witt	Problems noticed with Castor. High latency in logging into boxes.
2009-03-09	Late morning	C. Kruk	Problems with Castor LSF after a reconfiguration with nothing starting up.
2009-03-09	Late morning	M.Bly S. DeWitt AOD(C. Condurache)	Shaun spoke with Martin to ask if problem with network: Martin informed Shaun the primary DNS had a problem. Gradual deterioration in SAM tests. Shaun reported this DNS issue to Gareth/John. AoD (Catalin) kept informed.
2009-03-09	Early Afternoon	G. Smith	Confirmed with local contact (N. Hill) that there was a known problem. On DNS server with address 130.246.8.13. the daemon would not stay up for more than a couple of minutes.
2009-03-09	14:37	G. Smith	Whole site put into unscheduled 'At Risk' until 17:00. E-mail sent to gridpp_users and Atlas_UK_comp_operations.
2009-03-09	17:00	G.Smith	No progress reported on DNS problems. Spoke with site network team. Became clear they assumed we could use the secondary and they were having difficulty diagnosing the problem.
2009-03-09	17:15	M. Bly	Spoke at length with Nick Moore from ~17:15 - clear that they would be unable to diagnose the problem that evening even with his assistance. DNS left with and auto-restart-if-down going every 2 minutes.
2009-03-09	17:18	J.Kelly G.Smith	New 'At Risk' declared in GOC DB for whole site until 11am tomorrow. However, mistakenly marked as scheduled.
2009-03-10	08:53	A. Sansum	Andrew proposed dry run of Disaster procedure - convening team at 13:00 today.
2009-03-10	08:00 approx	M. Bly	Martin begins work with Network group to diagnose DNS issue. Issues with OS version and lack of tools available. Quickly clear once diagnostics working that a host elsewhere on site was flooding the DNS server. Steps taken to lock out errant host prove successful. Unable to positively identify and terminate the host at that time.
2009-03-10	09:18	M. Bly	Problem on DNS server resolved.
2009-03-10	09:35	M. Bly	Fix to DNS server confirmed.
2009-03-10	10:45	S. DeWitt C. Kruk	Problems with Atlas Castor instance resolved via restart.
2009-03-10	10:48	G. Smith	At Risk ended in GOC DB.
2009-03-10	11:19	A. Sansum	Stand down Disaster Plan.

Future mitigation:

Review DNS configuration on Tier1. This review to include:
- Client configurations on systems and methods by which they may be updated: Getting the resolver to use the multiple resilient DNS servers already configured (we had three but failed to use them). E.g. having a more dynamic system to deconfigure DNS servers from our configuration
- Ensuring that we are doing local caching properly
- Consider running our own DNS service that we can fix faster (but if we do this we should be certain that we can do so out of hours). E.g. Consider having a Tier1-managed slave DNS (like chilton/130.246.8.13 et al) in the Tier1 subnet which we use as a primary.

Review interaction with Networking. This review to include:
- How we obtain information from them (and feed back issues). Networking have a formal system for broadcasting info about problems (the villages list). From where the departmental representatives should broadcast it.

Related issues:

The site master DNS servers were not affected by this problem.
No single primary DNS used on Tier1 - it differs on different clases of host. The 'official' primary is 130.246.8.13 but the batch workers and several other systems use 130.246.72.21 as (incorrectly) instructed some years back. The disadvantage of using the general site primary is that it gets a lot of requests from everyone else (outside the Tier1) and is thus more prone to issues.
No formal notification of the DNS problems was received.
Until 17:00 there was no contact directly with Networking team to both pass back the level of our problems and to understand theirs.
Whilst some systems were showing intermittent failures, the degradation of some services was possibly sufficient to warrant an Outage on some of the services.
The setting of the overnight "At Risk" as 'scheduled' (rather than unscheduled) was a mistake.
Are we particularly exposed to this type of DNS problem? The DNS is fundamental to our operation - any issues are a problem, not just flooding. A downed primary that goes unnoticed for any length of time would cause similar issues.
The root cause of the DNS server struggling was a mis-configured system in another department that was bombarding it with packets. The DNS overload issue has hit networking before and they in part suspected this was the case on this occasion, though were unable to verify this initially.

Timeline


	Date	Time	Comment
Actually Started	2009-03-09	Mid morning	First indications of something wrong.
Fault first detected	2009-03-09	Late morning	Received news of state of DNS server.
First Advisory Issued	2009-03-09	14:37	Entry in GOCDB and e-mail to GridPP_users and Atlas_uk_comp_operations lists.
First Intervention	2009-03-09	17:15	First attempt to work with Networks Team. (Although they had been working on it beforehand).
Fault Fixed	2009-03-10	09:18	DNS server problem resolved, although there were subsequent follow on problems (triggered by the original fault) within Castor.
Announced as Fixed	2009-03-10	10:48	End outage in GOCDB and e-mail to GridPP_users and Atlas_uk_comp_operations lists.
Downtime(s) Logged in GOCDB	2009-03-09 & 10		At Risk: Unscheduled from 2009-03-09 14:37 to 15:00. Extended (although marked as scheduled) until 2009-03-10 11:00. This was then terminated at 10:48.
Other Advisories Issued	2009-03-09	17:22	E-mail to GridPP_users and Atlas_uk_comp_operations lists.

Difference between revisions of "RAL Tier1 Incident 20090309 DNS problems caused service outage"

Latest revision as of 09:36, 16 March 2009

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools