« Sponsored Post: Apple, Couchbase, Evernote, MongoDB, Stackdriver, BlueStripe, Surge, Booking, Rackspace, AiCache, Aerospike, New Relic, LogicMonitor, AppDynamics, ManageEngine, Site24x7 | Main | Stuff The Internet Says On Scalability For September 13, 2013 »

The Hidden DNS Tax - Cascading Timeouts and Errors

This is a guest post by Nick Burling, VP of Product Management of Bluestripe.

Readers of High Scalability know are well versed in performance optimization techniques. Reverse proxies, Varnish, Redis — you hear about them daily. But what you may not realize is that one of the oldest technologies in your stack can be one of your biggest bottlenecks: DNS.

People don't spend a lot of time thinking about DNS. It's not sexy. It's an infrastructure service, and it's just supposed to work.

At BlueStripe, we work with many teams running applications that support millions of web requests a day. We keep seeing DNS delays and errors that the platform operations team never knows about. It's so common we've start calling it the Hidden DNS Tax.

What is the Hidden DNS Tax?

The Hidden DNS Tax is a hard-to-see performance hit your users take from DNS timeouts and errors in your back-end architecture. We've seen it bring down the main web application for a Fortune 10 company.

A DNS lookup can fail, the user gets an error, and the application logs may never record it. Or, a web server may do an IPv4 lookup that hits the timeout threshold after 5 seconds.  This generates a new request, and if the DNS server still isn’t available, another timeout.  And so on – potentially creating delays that cascade through your applications.

The DNS Tax is particularly hard to identify, because the transaction requests eventually go through - so they don’t show up in the log files as errors.  They just take a long time.

What causes the Hidden DNS Tax?

The good news is that the Hidden DNS Tax is largely caused by configuration errors – human mistakes that can be quickly corrected.  Here’s how they happen:

1. DNS Errors: non-existent domain lookups.

In a web-scale architecture, with rapid release cycles, human error will happen. The code pushed to production may still be doing Lookups of the pre-production database and getting an Error Code 3 back (non-existent domain). Or, the database location wasn't updated in the DNS records during the release.

The effect of these DNS errors depends on how the application handles them. We've seen application requests that are blocked without successful DNS resolution. Others (fairly) gracefully handle 10,000 DNS errors every few hours. In either case, the errors need to be fixed.

2. DNS Timeouts: busy or unavailable DNS servers.

The most pernicious kind of DNS Tax is even harder to spot. If a DNS server is overwhelmed or unavailable, the application doing a Lookup waits for a timeout window before giving up. This window is often five or more seconds while the user is left waiting.

How Do I Detect the Hidden DNS Tax?

Option 1: Logs.

Many DNS systems will log error codes. The challenge is that timeouts won't be logged, since the server wasn't available to handle the request. It's also hard to correlate errors in the logs of a busy DNS with which application component was harmed and by how much.

Option 2: Transaction monitoring.

There are tools available that can trace the execution of a user request across your architecture. The better ones show when a component made DNS requests, how long they took, and whether there were errors. Shameless plug: we've got one you can try out at http://bluestripe.com/express.


Building a scalable website is always a challenge.  In many cases there are DNS related performance gains available – if you can detect the errors.  Keep in mind that:

  • DNS impacts performance
  • Many teams don't think to watch for DNS problems, or don't have a way to monitor them
  • DNS errors are caused by mistakes; timeouts come from DNS availability issues
  • You need to be monitoring how DNS impacts the application and its users

Avoid making your uses pay the Hidden DNS Tax – and help improve your overall user experience.

Reader Comments (3)

How is this service better than Amazon Route 53?

September 17, 2013 | Unregistered CommenterKenneth

There's another DNS problem at the Hadoop cluster scale, which is simply "DNS overload". If 1000 servers come up, each reading in their hadoop-site.xml and doing an nslookup of the various servers in there, central enterprise DNS servers can get quite a surprise. Having the in-cluster IP Addresses stored in /etc/hosts avoids this.

That may appear to increase the cost of changing IP address on those central services, but as Java processes cache IP addresses forever by default, you need to always bounce all the worker nodes when the master nodes change IP.

If your hadoop cluster is doing a lot more external DNS lookup, consider a caching DNS server per node, or at the very least, a set for the entire cluster's benefit.

September 23, 2013 | Unregistered CommenterSteveL


Shameless plug: we've got one you can try out at http://bluestripe.com/express.

I signed up for the trial.

Install the FactFinder Console on your workstation. The Console must be installed on a Windows machine.

Questionable product decisions aside, that's really something you want to mention anywhere on the Express website, before registering and reaching the download page.

October 3, 2013 | Unregistered CommenterMichael

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>