Example

ButterCMS Architecture: a Mission-Critical API Serving Millions of Requests per Month

High Scalability

Oct 16, 2017 — 5 min read

This is a guest post by Jake Lumetta, co-founder and CEO of ButterCMS.

ButterCMS lets developers add a content management system to any website in minutes. Our business requires us to deliver near-100% uptime for our API, but after multiple outages that nearly crippled our business, we became obsessed with eliminating single points of failure. In this post, I’ll discuss how we use Fastly’s edge cloud platform and other strategies to make sure we keep our customers’ websites up and running.

At its core, ButterCMS offers:

A dashboard for content editors

A JSON API for fetching content

SDK’s for integrating ButterCMS into native code

ButterCMS Tech Stack

ButterCMS is a single monolithic Django application that is responsible for the marketing website, editing tools, API, and backoffice tools for customer support. The Django application runs on Heroku along with a Postgres database.

We also utilize the following 3rd-party services:

Filestack for providing their customers image editing

Fastly for external API caching and delivery

Cloudfront as a CDN for customers’ assets

EasyDNS for DNS

Downtime is Fatal

Our customers typically build websites that make an API request to ButterCMS for page content during their request/response lifecycle. This means that if their API request to ButterCMS fails, their page likely won’t render. If our API goes down, our customers’ websites go down with us.

This is a lesson we learned the hard way in our early days. Unreliable server hosting led to frequent intermittent outages and performance degradations that frustrated customers. A botched DNS migration led to hours of API downtime that took down dozens of customers’ websites for nearly half a day and left a large number of customers questioning whether they could continue relying on us (a handful of them left us).

After this incident, we recognized that ensuring near-100% uptime was an existential issue. A significant outage in the future could lead to us losing hard-earned customers and put our business in crisis.

Delivering a Global, Fast, Resilient API

Avoiding failure completely is not possible–you can only do your best to reduce your chances.

For example, “controlling your own fate” by running your own physical servers protects you against your hosting provider going down, but puts you in the position of having to handle security and scalability, both of which can easily take you down and be difficult to recover from.

For our team, keeping our API up at all times and making sure it delivered high performance across the globe was crucial. But as a smaller company, we knew we didn’t have the resources to deliver global, highly scalable performance with near-100% uptime. So we turned to someone that did: Fastly.

Fastly describes itself as an “edge cloud platform powers fast, secure, and scalable digital experiences for the world's most popular businesses”. They work with large customers including the New York Times, BuzzFeed, Pinterest, and New Relic. We put Fastly in front of our API as a cache layer so that all API requests are served via their CDN.

When one of our customers updates their website content in ButterCMS, we invalidate the API keys for the specific bits of content that were edited. Non-cached requests hit our servers but we have a 94% hit rate because content on our customers’ websites changes infrequently, relative to the number of visitors they get. This means that even if our database or servers experience intermittent outages, our API remains up. We wouldn’t want this, but theoretically, our servers could go down completely for several hours and our customers’ websites would stay up so long as Fastly was.

Fastly’s global CDN offers another benefit to us. Many of our customers have static JavaScript websites where API requests are made from their visitors’ browsers rather than their servers. Serving API responses via Fastly’s CDN means that our customers’ website visitors get fast load times wherever they’re located.

Eliminating single points of failure

During the early days of ButterCMS, we dealt with two separate DNS incidents that left us scarred. In the first incident, our DNS provider at the time accidentally “cancelled” our account from their system, leading to an outage that took nearly 6 hours for us to fully recover from. Our second incident occurred when routine DNS editing led to a malfunction by our [different] DNS provider, and took nearly half a day to resolve. DNS incidents are particularly damaging because even after an issue is identified and fixed, you have to wait for various DNS servers and ISP’s to clear their caches before customers’ see the fix on their end (DNS servers also tend to ignore your TTL setting and impose their own policy).

Our experiences made us extremely focused on eliminating any single point of failure across our architecture.

For DNS, we switched to using multiple nameservers from different DNS providers. DNS providers often allow and encourage you to use 4-6 redundant nameservers (eg. ns1.example.com, ns2.example.com). This is great: if one fails, requests will still be resolved by the others. But since all of your nameservers are from a single company, you’re putting a lot of faith that they are going to have 100% uptime.

For our application servers, we use Heroku’s monitoring and auto-scaling tools to make sure our performance doesn’t degrade from spikes in traffic (or if Fastly goes down and we suddenly need to route all requests directly to our servers). In addition to caching our API with Fastly, we also cache our API at the application level using Memcached. This provides an additional layer of buffer against intermittent database or server failure.

To protect against the rare possibly of a total outage across Heroku or AWS (which Heroku runs on), we maintain a separate server and database instance running on Google Cloud that we can failover to quickly.

Failure is inevitable

No matter how reliable our API is, we have to accept that networks are unreliable and failures are bound to occur. We’ve all experienced trouble connecting to Wi-Fi, or had a phone call drop on us abruptly. Outages, routing problems, and other intermittent failures may be statistically unusual on the whole, but still bound to be happening all the time at some ambient background rate.

To overcome this sort of inherently unreliable environment, we help our customers build applications that will be robust in the event of failure. Our SDK’s offer features such as automatically retrying when API requests fail, or support for easily using a failover cache such as Redis on the client.

Conclusion

Without realizing it, many of us are building single points of failure into our stack. At ButterCMS, our success depends on ensuring that our customers applications don’t ever go down because of us. We do this by eliminating as many single points of failure as possible from our back-end infrastructure, and providing SDK’s that make it easy for our customers to achieve resiliency and fault-tolerance within their applications.

About ButterCMS

When you hear “CMS” or “blogging”, you probably think of WordPress. ButterCMS is a newer alternative that allows development teams to add CMS and blogging functionality into their own native codebases using the ButterCMS API.

ButterCMS was started by Jake Lumetta and Abi Noda because both of them had encountered the challenge of finding alternatives to WordPress that were fully featured, flexible, and didn’t bind you to a specific programming language like PHP..

Today, ButterCMS powers hundreds of websites across the world, helping serve millions of requests per month.

ButterCMS Architecture: a Mission-Critical API Serving Millions of Requests per Month

High Scalability

ButterCMS Tech Stack

Downtime is Fatal

Delivering a Global, Fast, Resilient API

Eliminating single points of failure

Failure is inevitable

Conclusion

About ButterCMS

Read more

Capturing A Billion Emo(j)i-ons

Brief History of Scaling Uber

Behind AWS S3’s Massive Scale

The Swedbank Outage shows that Change Controls don't work