The Hidden DNS Tax - Cascading Timeouts and Errors

This is a guest post by Nick Burling, VP of Product Management of Bluestripe.

Readers of High Scalability know are well versed in performance optimization techniques. Reverse proxies, Varnish, Redis — you hear about them daily. But what you may not realize is that one of the oldest technologies in your stack can be one of your biggest bottlenecks: DNS.

People don't spend a lot of time thinking about DNS. It's not sexy. It's an infrastructure service, and it's just supposed to work.

At BlueStripe, we work with many teams running applications that support millions of web requests a day. We keep seeing DNS delays and errors that the platform operations team never knows about. It's so common we've start calling it the Hidden DNS Tax.

What is the Hidden DNS Tax?

The Hidden DNS Tax is a hard-to-see performance hit your users take from DNS timeouts and errors in your back-end architecture. We've seen it bring down the main web application for a Fortune 10 company.

A DNS lookup can fail, the user gets an error, and the application logs may never record it. Or, a web server may do an IPv4 lookup that hits the timeout threshold after 5 seconds.  This generates a new request, and if the DNS server still isn’t available, another timeout.  And so on – potentially creating delays that cascade through your applications.

The DNS Tax is particularly hard to identify, because the transaction requests eventually go through - so they don’t show up in the log files as errors.  They just take a long time.

What causes the Hidden DNS Tax?

The good news is that the Hidden DNS Tax is largely caused by configuration errors – human mistakes that can be quickly corrected.  Here’s how they happen:

1. DNS Errors: non-existent domain lookups.

In a web-scale architecture, with rapid release cycles, human error will happen. The code pushed to production may still be doing Lookups of the pre-production database and getting an Error Code 3 back (non-existent domain). Or, the database location wasn't updated in the DNS records during the release.

The effect of these DNS errors depends on how the application handles them. We've seen application requests that are blocked without successful DNS resolution. Others (fairly) gracefully handle 10,000 DNS errors every few hours. In either case, the errors need to be fixed.

2. DNS Timeouts: busy or unavailable DNS servers.

The most pernicious kind of DNS Tax is even harder to spot. If a DNS server is overwhelmed or unavailable, the application doing a Lookup waits for a timeout window before giving up. This window is often five or more seconds while the user is left waiting.

How Do I Detect the Hidden DNS Tax?

Option 1: Logs.

Many DNS systems will log error codes. The challenge is that timeouts won't be logged, since the server wasn't available to handle the request. It's also hard to correlate errors in the logs of a busy DNS with which application component was harmed and by how much.

Option 2: Transaction monitoring.

There are tools available that can trace the execution of a user request across your architecture. The better ones show when a component made DNS requests, how long they took, and whether there were errors. Shameless plug: we've got one you can try out at http://bluestripe.com/express.

Summary

Building a scalable website is always a challenge.  In many cases there are DNS related performance gains available – if you can detect the errors.  Keep in mind that:

  • DNS impacts performance
  • Many teams don't think to watch for DNS problems, or don't have a way to monitor them
  • DNS errors are caused by mistakes; timeouts come from DNS availability issues
  • You need to be monitoring how DNS impacts the application and its users

Avoid making your uses pay the Hidden DNS Tax – and help improve your overall user experience.