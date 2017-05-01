Monday, May 1, 2017 at 9:03AM

This is a guest repost by G Gordon Worley III, Head of Site Reliability Engineering at AdStage.

When I joined AdStage in the Fall of 2013 we were already running on Heroku. It was the obvious choice: super easy to get started with, less expensive than full-sized virtual servers, and flexible enough to grow with our business. And grow we did. Heroku let us focus exclusively on building a compelling product without the distraction of managing infrastructure, so by late 2015 we were running thousands of dynos (containers) simultaneously to keep up with our customers.

We needed all those dynos because, on the backend, we look a lot like Segment, and like them many of our costs scale linearly with the number of users. At $25/dyno/month, our growth projections put us breaking $1 million in annual infrastructure expenses by mid-2016 when factored in with other technical costs, and that made up such a large proportion of COGS that it would take years to reach profitability. The situation was, to be frank, unsustainable. The engineering team met to discuss our options, and some quick calculations showed us we were paying more than $10,000 a month for the convenience of Heroku over what similar resources would cost directly on AWS. That was enough to justify an engineer working full-time on infrastructure if we migrated off Heroku, so I was tasked to become our first Head of Operations and spearhead our migration to AWS.

It was good timing, too, because Heroku had become our biggest constraint. Our engineering team had adopted a Kanban approach, so ideally we would have a constant flow of stories moving from conception to completion. At the time, though, we were generating lots of work-in-progress that routinely clogged our release pipeline. Work was slow to move through QA and often got sent back for bug fixes. Too often things “worked on my machine” but would fail when exposed to our staging environment. Because AdStage is a complex mix of interdependent services written on different tech stacks, it was hard for each developer to keep their workstation up-to-date with production, and this also made deploying to staging and production a slow process requiring lots of manual intervention. We had little choice in the matter, though, because we had to deploy each service as its own Heroku application, limiting our opportunities for automation. We desperately needed to find an alternative that would permit us to automate deployments and give developers earlier access to reliable test environments.

So in addition to cutting costs by moving off Heroku, we also needed to clear the QA constraint. I otherwise had free reign in designing our AWS deployment so long as it ran all our existing services with minimal code changes, but I added several desiderata:

Simple system administration : I’d worked with tools like Chef before and wanted to avoid the error-prone process of frequently rebuilding systems from scratch. I wanted to update machines by logging into them and running commands.

: I’d worked with tools like Chef before and wanted to avoid the error-prone process of frequently rebuilding systems from scratch. I wanted to update machines by logging into them and running commands. Boring : I wanted to use “boring” technology known to work rather than try something new and deal with its issues. I wanted to concentrate our risk in our business logic not in our infrastructure.

: I wanted to use “boring” technology known to work rather than try something new and deal with its issues. I wanted to concentrate our risk in our business logic not in our infrastructure. Zero downtime : Deploying on Heroku tended to cause our users to experience “blips” due to some user requests taking longer to run than Heroku allowed for connection draining. I wanted to be able to eliminate those blips.

: Deploying on Heroku tended to cause our users to experience “blips” due to some user requests taking longer to run than Heroku allowed for connection draining. I wanted to be able to eliminate those blips. Rollbacks : If something went wrong with a deploy I wanted to be able to back out of it and restore service with the last known working version.

: If something went wrong with a deploy I wanted to be able to back out of it and restore service with the last known working version. Limited complexity : I was going to be the only person building and maintaining our infrastructure full-time, so I needed to scope the project to fit.

Knowing that Netflix managed to run its billion dollar business on AWS with nothing fancier than Amazon machine images and autoscaling groups, I decided to follow their reliable but by no means “sexy” approach: build a machine image, use it to create instances in autoscaling groups, put those behind elastic load balancers, and connect the load balancers to DNS records that would make them accessible to our customers and each other.

Thus I set out to build our AWS deployment strategy.

Becoming an AWS Sumo

When I’m engineering a system, I like to spend a lot of time up front thinking things through and testing assumptions before committing to a design. Rich Hickey calls this hammock driven development.

Our office doesn’t have a hammock, so I used our Sumo lounger instead.

Over the course of a couple months in the Spring of 2016 I thought and thought and put together the foundations of our AWS deployment system. It’s architecture looks something like this: