Making the Case for Building Scalable Stateful Services in the Modern Era
Monday, October 12, 2015 at 8:56AM
Todd Hoff in Example

For a long time now stateless services have been the royal road to scalability. Nearly every treatise on scalability declares statelessness as the best practices approved method for building scalable systems. A stateless architecture is easy to scale horizontally and only requires simple round-robin load balancing.

What’s not to love? Perhaps the increased latency from the roundtrips to the database. Or maybe the complexity of the caching layer required to hide database latency problems. Or even the troublesome consistency issues.

But what of stateful services? Isn’t preserving identity by shipping functions to data instead of shipping data to functions a better approach? It often is, but we don’t hear much about how to build stateful services. In fact, do a search and there’s very little in the way of a systematic approach to building stateful services. Wikipedia doesn’t even have an entry for stateful service.

Caitie McCaffrey, Tech Lead for Observability at Twitter, is fixing all that with a refreshing talk she gave at the Strange Loop conference on Building Scalable Stateful Services (slides).

Refreshing because I’ve never quite heard of building stateful services in the way Caitie talks about building them. You’ll recognize most of the ideas--Sticky Sessions, Data Shipping Paradigm, Function Shipping Paradigm, Data Locality, CAP, Cluster Membership, Gossip Protocols, Consistent Hashing, DHT---but she weaves them around the theme of building stateful services in a most compelling way.

The highlight of the talk for me is when Caitie ties the whole talk together around the discussion of her experiences developing Halo 4 using Microsoft’s Orleans on top of Azure. Orleans doesn’t get enough coverage. It’s based on an inherently stateful distributed virtual Actor model; a highly available Gossip Protocol is used for cluster membership; and a two tier system of Consistent Hashing plus a Distributed Hash Table is used for work distribution. With this approach Orleans can rebalance a cluster when a node fails, or capacity is added/contracted, or a node becomes hot. The result is Halo was able to run a stateful Orleans cluster in production at 90-95% CPU utilization across the cluster.

Orleans isn't the only example system covered. Facebook's Scuba and Uber's Ringpop are also analyzed using Caitie's stateful architecture framework. There's also a very interesting section on how Facebook cleverly implements fast database restarts for large in-memory databases by decoupling the memory lifetime from the process lifetime.

So let’s jump in and learn how to build stateful services...

Stateless Services are Wasteful

Stateful Services are Easier to Program

Building Sticky Connections

Cluster Membership

Static Cluster Membership - The Simplest Approach

Dynamic Cluster Membership

Gossip Protocols - Emphasise Availability

Consensus Systems - Emphasise Consistency

Work Distribution

Random Placement

Consistent Hashing - Deterministic Placement

Distributed Hash Table (DHT) - Non Deterministic placement

Three Example Stateful Services in the Real World

Facebook’s Scuba - Random Fanouts on Writes

Uber’s Ringpop - Gossip Protocol + Consistent Hashing

Microsoft’s Orleans - Gossip Protocol + Consistent Hashing + Distributed Hash Table

What Can Go Wrong

Unbounded Data Structures

Memory Management

Reloading State

First Connection

Recovering From Crashes

Deploying New Code

Fast Database Restarts at Facebook

Wrapping Up

Related Articles

Article originally appeared on (
See website for complete article licensing information.