This is a guest post by Michael Schenck, SRE Staff Engineer at Tumblr.
At Tumblr, blogs (or Tumblelog) are one of our most highly trafficked faces on the internet. One of the most convenient aspects of tumblelogs is their highly cacheable nature, which is fantastic because of the high views/post ratio the Tumblr network offers our users. That said, it's not entirely trivial to scale out the perimeter proxy tier, let alone the caching tier, necessary for serving all of those requests.
This article describes the architecture of the portion of our perimeter responsible for blogs serving, one of our more highly trafficked perimeter end-points.
Here's how we do it.
278 employees total, 6 on the team responsible for all of Tumblr's perimeter (Perimeter-SRE), including one manager.
Over 2800 servers, less than 20% are used for blog serving functionality
Over 23,000 blog requests per second (at peak)
Over 6,500 blog cache purge events per second (at peak)
Over 196 million blogs
Over 93 billion posts
Where we were - map-based partitioning
In the early days, we only needed one active and one standby proxy server as well as varnish node. Both were very easy to manage, monitor, and make highly available.
However, being inline to all user requests, capacity limits will be reached and next steps (even if not the ideal deployment) will need to be ready before you wind up with downtime due to popularity.
Outgrowing a single proxy node
Outgrowing a single active proxy server is pretty common and often involves DNS. Something as basic as a round-robin A record might meet your needs, but often times it's worth the extra money to go for a health-checking GSLB configuration (even if you only have one geographic location).
The downside to DNS is that, while nameservers will respond with each IP at a fairly even rate, there is no guarantee that each lookup will be used for the same number of requests. User A might make 10 requests to a resolved IP in a single minute, where bot B might make 100 requests in the same minute. If you have two IPs, user A gets one and bot B gets the other and they're the only two clients making requests, then one of your proxies will have 10X the request rate of the other.
This effect can be mitigated with lower TTLs, such that a 30 second TTL can balance those 110 requests between the two proxies. For the first 30 seconds user A will go to proxy P1, and bot B will go to proxy P2. Then, their next resolution might swap the IPs, so that user A will send its requests to proxy P2 and bot B will send its requests to proxy P1. At the end of that minute window, each proxy would have handled roughly 60 requests. The downside to lower TTLs is more lookups, thus higher DNS costs (but DNS is typically one of your less expensive third-party services).
Outgrowing a single varnish node
While DNS can buy you a lot of time with growing your proxy tier, scaling varnish is a little more complex. Even if your capacity limitation with a single varnish node revolves around request concurrency, simply adding two varnish nodes in round-robin is not what you want to do; this reduces cache-hit ratio, makes purging more resource/time intensive, and doesn't actually increase the working size of your cache (only duplicates it).
The simplest iteration to outgrowing a single varnish node is static partitioning. This involves determining your unique identifier, and to split this space between two varnish nodes.
For Tumblelogs, this is the hostname of the blog. Since DNS is case-insensitive, you only have to worry about 40 characters; alphanumerics (a-z and 0-9) and four allowed characters (- _ . ~). So for two varnish nodes, blogs hostnames are split (on the first character) between these two cache nodes.
Evenly distributed partitioning - through consistent hashing
Both of the previous examples (DNS round-robin and static partitioning) are steps in the right direction, but provide very coarse granularity in partitioning. At a small enough scale, this granularity isn't necessarily problematic, but as your traffic starts to grow the variance becomes more significant. As a result, reducing the variance in traffic handled by your least-hot and most-hot nodes becomes more-and-more important. This is where consistent hashing can really shine.
Partitioning proxy traffic
If your servers are in an environment where you can influence the routing tables of the router(s) in-front of your servers, the routers between your users and your proxy servers, then you can take advantage of equal-cost multi-path routing (ECMP). ECMP affords you the ability to treat your proxies as slices in a consistent hash ring, then map requesters across these slices.
This is accomplished by informing the routing infrastructure of multiple paths (proxy servers) to a particular destination IP (a highly available IP). ECMP will then hash the request source in order to determine which proxy should receive the packets for this request session. Typical ECMP implementations offer Layer 3 (IP-only) and Layer 3+4 (IP:port) hashing options. Layer 3 means that all requests from a particular IP will go to a particular proxy, which can be helpful for debugging but is imbalanced with large networks using a single NAT IP. Layer 3+4 typically provides the best distribution, but debugging particular clients becomes more challenging.
There are a number of ways to inform the router(s) of multiple paths, but I suggest using OSPF or iBGP for dynamic route advertisements. One need only listen to the highly-available IP on a loopback interface, enable internal routing, and advertise one's own IP as a next-hop to the highly available IP. We found that BIRD provides a light-weight and reliable means for performing route advertisements from our proxies.
Partitioning varnish traffic
Tumblelogs are identified by their fully qualified domain name (FQDN), such that all URI paths of a blog will always be found under that blogs FQDN. The majority of Tumblelogs are sub-domains of tumblr.com, such as engineering.tumblr.com, however we also support users bringing their own custom domain names.
When considering this variety of FQDNs, it's clear that the TLD will have the least number of variations, then domain names (particularly due to the vast majority being tumblr.com), then sub-domains. So our most-significant bits appear at the leftmost positions of a variable-length string.
Understanding the problem domain
perfect - demonstrates if the hashing function was perfect when applied to our test dataset
consistent_hdr - consistent hashing on Host header (best real-world results)
consistent_hdr_use_domain_only - consistent hashing on base domain name (i.e. tumblr.com or foo.net), only two camps tumblr.com and all-others
mapbased_firstchar - mapping Host header's first character to varnish node (our original static partitioning implementation)
mapbased_hdr - map based on Host header
While consistent hashing was the clear front-runner for most-even distribution of tumblelog FQDNs, we then went on to determine if the hash function was appropriate. HAProxy uses the SDBM hashing function by default. However, after further investigation, comparing SDBM, CRC, MD5, DJB2, and so on, we determined that DJB2 offered even better distribution. This resulting in our submitting a patch to HAProxy to make the hash function configurable (see "Thanks" section for more information).
Comparing static partitioning to consistent hashing
This shows the change in variance between requests per second of each varnish node, before and after moving to the best-fit hash function.
In either model, node growth will mean keyspace shift, thus cache invalidation. In the consistent hashing model, it's much easier to predict the percent of keys that will be invalidated; essentially 1/N (N is the number of cache nodes prior to the new node being added). With the static partitioning model, unless you do analysis on the requests going to the node(s) you will be taking keyspace from, you're left with the worst case of less-than or equal-to the total percent of keys on the nodes you're taking keyspace from.
With static partitioning, a single node failure will result in 1/N keys being inaccessible unless you provide a fail-over option. HAProxy does allow you to have a standby node, but now you have a decision to make; do you have 2N cache nodes (one active, one standby) for each key space, or shared standby node(s). One extreme is a waste of 50% of your hardware, where the other end of the spectrum (1 standby node shared between all active nodes) means that two failed nodes results in the standby support 2X the keyspace of the other active nodes.
With consistent hashing, node failures are handled automatically. When one node is removed, then 1/N keys are shifted (resulting in 1/N keys to be invalidated) and an even rise in keyspace per remaining active node.
Sending purge requests to a single varnish node is easy, but purging from multiple varnish nodes should be just as easy. Instead of having to keep the proxies and purgers in sync, it's easiest to just send all purge requests through the same proxies.
It's important to reject purge attempts from non-local IP space, as to prevent any malicious bulk purging.
You know answers to questions you haven't asked yet. When facing a scaling challenge, don't over look patterns you're already using elsewhere.
Scale through simplicity. Adding complexity to overcome scalability challenges may work in the short run, but will eventually catch up with you.
Know your hash function. The hash function you use is just as important as deciding what to do with the hashes.
Degrade, not fail. It's advisable to have your proxies monitor their own ability to reach their backends. If they cannot, don't stop advertising routes, just advertise non-preferential routes (such as a higher path cost, with OSPF). This way, if all backends become unhealthy, you can still serve error pages, instead of becoming unreachable.