Building Globally Distributed, Mission Critical Applications: Lessons From the Trenches Part 2

This is Part 2 of a guest post by Kris Beevers, founder and CEO, NSONE, a purveyor of a next-gen intelligent DNS and traffic management platform. Here's Part 1.

Integration and functional testing is crucial

Unit testing is hammered home in every modern software development class.  It’s good practice. Whether you’re doing test-driven development or just banging out code, without unit tests you can’t be sure a piece of code will do what it’s supposed to unless you test it carefully, and ensure those tests keep passing as your code evolves.

In a distributed application, your systems will break even if you have the world’s best unit testing coverage. Unit testing is not enough.

You need to test the interactions between your subsystems. What if a particular piece of configuration data changes – how does that impact Subsystem A’s communication with Subsystem B? What if you changed a message format – do all the subsystems generating and handling those messages continue to talk with each other? Does a particular kind of request that depends on results from four different backend subsystems still result in a correct response after your latest code changes?

Unit tests don’t answer these questions, but integration tests do. Invest time and energy in your integration testing suite, and put a process in place for integration testing at all stages of your development and deployment process. Ideally, run integration tests on your production systems, all the time.

There is no such thing as a service interrupting maintenance

If you’re building a truly mission critical application, one your customers depend upon to operate their businesses, then there is no off switch. It can never stop working. You never get to have a service interrupting maintenance. Even the most complex backend architectural changes must happen without a blip.

This is one of the reasons you should think hard about architecture up front. A few hours of whiteboarding could save months of effort down the line.

One example: at NSONE, we’ve been lucky enough to get most of our architecture right from the beginning. One thing we didn’t get right at first: we accept high frequency data feeds into our platform that impact how we answer queries to DNS records with complex traffic management configs. Data feeds can apply across multiple DNS records, so, say, a feed of server load telemetry can inform traffic routing decisions relating to multiple websites hosted on the server. We assumed you’d only ever really connect a single data feed to a few DNS records. So we saved some time and effort early on by expanding data feed coming into our systems into multiple messages, one for each connected DNS record, to be pushed to our edge locations. We assumed wrongly, and some of our favorite customers connect data feeds to thousands of DNS records!  Our early laziness was causing us to DoS ourselves internally, and we knew it was only going to get worse as we continued to grow.

The problem: we couldn’t just backtrack and fix the problem by sending fewer control messages. We needed to change the data model and our messaging model, along with 4-5 interacting always-on systems. What would have taken 2-3 hours of extra thought and effort early on turned into a six week marathon: brainstorming sessions, deeply complex refactoring, massive correctness testing efforts, and a series of carefully coordinated deployments and migrations, all to address the issue without interrupting service.

While that’s an extreme case, every deployment of new infrastructure or code needs to be seamless: careful planning, rolling restarts, constant integration testing. Once you’re serving customers, there’s no shutting down.

Automate deployments and config management – with extreme care

The modern devops ecosystem is awash with tools for deployment automation and configuration management: Chef, Puppet, Ansible, SaltStack, Terraform, and what seems like zillions more. Which tools you’re using doesn’t matter as much as you think – do your reading and decide which model makes sense for you. But what does matter is that you are using these tools. Do not manage your configurations or your deployments by hand, even from the earliest stages of your company, even when it seems quicker or easier: you will make mistakes, you’ll restrict your ability to scale, and it’ll be dramatically harder to retrofit automation to a moving target at even moderate scale.

But! Be careful: with great power comes great responsibility. Deployment automation tools enable you to shoot your platform in the head like nothing else.

Managing iptables rules for all your hosts with Chef? One tired devops engineer and a single button push can disable your platform globally. Pushing out a new feature which you swear you’ve tested up, down, left and right in your staging environment? An end-to-end automated deployment will kill your product dead when you run into that one subtle difference between real-world and simulated traffic. Use automation wisely.

We manage NSONE’s configs and deployments with Ansible. It’s a great tool with plenty of quirks. We could automate everything and push new DNS delivery code to all our edge locations in a single button press, but we will absolutely never do that. We roll out deployments facility by facility, from lowest traffic to highest. Within a facility, we roll out server by server, and even core by core, running a comprehensive functional testing suite at every step of the way.  Potentially performance-impacting changes burn in while we study our metrics, sometimes for hours or days, before we move onto more critical facilities. And before we even get started with a deployment, we sign off on comprehensive code reviews, not just for our application code, but for our Ansible playbooks and config as well.

Put in place a process that makes sense for your team and your application, but don’t forget that while automation enables you to grow fast, it can kill you fast as well.

Implement fire drills

Bad things happen. Every tech company will have servers fail. Since we started NSONE, we’ve had every kind of server failure imaginable: disk failures, NIC failures, RAM corruption, kernel panics, noisy neighbor side effects in virtualized environments, and everything in between. Server failures are the easy ones.

Power will go out. Remember Hurricane Sandy? Across the street from NSONE’s current offices in Lower Manhattan, technicians were lugging diesel fuel up flights of stairs in buckets to keep infrastructure online.

Fiber will be cut. BGP will be hijacked. You will be DoSed, with varying levels of sophistication: from script kiddies sending 64k ICMP packets out of their parents’ basement, to full-on DNS and NTP reflection amplification attacks DDoSing your infrastructure at millions or hundreds of millions of packets per second.

What will you do?

The only way to get it right is to practice. Simulate the bad things before they happen. Netflix’s Chaos Monkey is one well-known example. You can’t always simulate every kind of incident directly, but do your best to simulate your response. It’s the only way to ensure that when the time comes, you will stay calm, use the tools you’ve put in place, and react efficiently in a difficult situation.

Minimize surface area

As your application becomes more widely distributed, the surface area exposed to malicious attack increases dramatically unless you take careful precautions.  Lock your systems down and minimize the attack surface exposed to the internet.

Each role in your architecture should expose the services it provides only to the set of systems that need to access those services. Your Internet facing systems should expose the services you provide to your users, and nothing else. On the backend, your systems should interact with each other over private IP space when possible, and when that’s not feasible – say, for communication across far-flung facilities – data should flow through encrypted channels. Use firewalling aggressively, whether through provider tools like AWS’s security groups, through automated management of iptables rules, using router ACLs or hardware firewalls, or some other mechanism. Deny first, and allow specific traffic on a per-role basis.

Never allow direct operator access, such as via ssh, to your production systems.  Humans should enter your infrastructure via heavily locked down bastion hosts with multi-factor authentication, port knocking, IP whitelisting, and other pretty darn draconian security policies in place. Make sure your bastion hosts are distributed across diverse networks and geographies. Entry to the rest of your production infrastructure should be restricted to sessions initiated from your bastion hosts.

There are a million strategies for locking down your systems. I’ve described some practices we’ve found effective, but they’re by no means exhaustive. Do your reading, and think about security from the very beginning: it’s part of your architecture. Don’t let it lapse to save time as your systems scale.

Understand the provider landscape

Almost every modern tech company builds the first versions of their product on AWS, DigitalOcean, or some other reasonably low cost, low barrier cloud infrastructure provider. The Internet existed before AWS, and the diversity of infrastructure providers has widened in the last decade.

Most companies needn’t rush to abandon AWS as they scale. But every company should preserve optionality when possible.

You may find that particular workloads in your application are best served by bare metal, or that you need a CDN with great throughput in Australia, or that you’ve got a key set of users in markets without an AWS presence. Take some time and familiarize yourself with various aspects of the infrastructure ecosystem: IaaS providers, colocation, DNS, CDN, monitoring, and more. Have an idea, as you think about your architecture early on, where you might need to look as your traffic grows and your platform needs to scale. Be ready to move fast.

At NSONE, we operate a sophisticated global anycasted DNS delivery network.  AWS made sense while we were prototyping, but well ahead of launching our platform, we had to move elsewhere because of our networking needs. Building a competitive anycasted DNS network, tuned for world class reliability and performance, depended largely on our ability to navigate the complex hosting and carrier landscape, pushing infrastructure and network providers to support the depth of control we needed. In many cases, we educated providers and helped them put in place the capabilities we needed to take advantage of.

Don’t forget: infrastructure operators are geeks too, and they’re often open to new ideas and challenges. Learn about the ecosystem and engage.

Wrapping up

I’ve walked through a wide variety of lessons we’ve learned building and scaling distributed, mission critical systems. You’ll never get everything right from the beginning, but you can put yourself in a good position to react gracefully and efficiently as your company grows.

Audiences are global. Application delivery has followed suit, and any new company building an online property should think about how to architect and scale in a distributed fashion to provide the highest quality of service to its users, maximizing performance, reliability, and security.

If you missed the first part of this post then here's Part 1.