Netflix: Run Consistency Checkers All the time to Fixup Transactions

You might have consistency problems if you have: multiple datastores in multiple datacenters, without distributed transactions, and with the ability to alternately execute out of each datacenter;  syncing protocols that can fail or sync stale data; distributed clients that cache data and then write old back to the central store; a NoSQL database that doesn't have transactions between updates of multiple related key-value records; application level integrity checks; client driven optimistic locking.

Sounds a lot like many evolving, loosely coupled, autonomous, distributed systems these days. How do you solve these consistency problems? Siddharth "Sid" Anand of Netflix talks about how they solved theirs in his excellent presentation, NoSQL @ Netflix : Part 1, given to a packed crowd at a Cloud Computing Meetup.

You might be inclined to say how silly it is to have these problems in the first place, but just hold on. See if you might share some of their problems, before getting all judgy:

  • Netflix is in the process of moving an existing Oracle database from their own datacenter into the Amazon cloud. As part of that process the webscale data, the data that is proportional to user traffic and thus needs to scale, has been put in NoSQL databases like SimpleDB and Cassandra, in the cloud. Complex relational data still resides in the Oracle database. So they have a bidirectional sync problem. Data must sync between the two systems that reside in different datacenters. So there are all the usual latency, failure and timing problems that can lead to inconsistencies. Eventual consistency is the rule here. Since they are dealing with movie data and not financial data, the world won't end.
  • Users using multiple devices also leads to consistency issues. This is a surprising problem and one we may see more of as people flow seamlessly between various devices and expect their state to flow with them. If you are watching a video on your Xbox, pause it, then start watching the video on your desktop, should the movie start from where you last paused it? Ideally yes, and that's what Netflix tries to do. But think for a moment all the problems that can occur. All these distributed systems are not in a transactional context. The movie save points are all stored centrally, but since each device operates independently, they can be inconsistent. Independent systems usually cache data which means they can write stale data back to the central database which then propagates to the other devices. Or changes can happen from multiple devices at the same time. Or changes can be lost and a device will have old data. Or a device can contact a different datacenter than another device and get a different value. For a user this appears like Netflix can't pause and resume correctly, which is truish, but it's more tricky than that. The world never stands still long enough to get the right answer.
  • One of the features NoSQL databases have dropped is integrity checking. So all constraints must be implemented at the application layer. Applications can mess up and that can leave your data inconsistent.
  • Anther feature of NoSQL databases is they generally don't have transactions on multiple records. So if you update one record that has a key pointing to another record that will be updated in a different transaction, those can get out of sync.
  • Syncing protocols are subject to the same failures of any other programs. Machines can go down, packets can be dropped or reordered. When that happens your data might be out of sync.

How does Netlix handle these consistency issues?

  • Optimistic locking. One approach Netflix uses to address the consistency issues is optimistic locking. Every write has a timestamp. The newest time stamp wins, which may or may not always be correct, but given the types of data involved, it's usually good enough. NTP is used to keep all the nodes timed synced.
  • Consistency checkers. The heavy hitter strategy they use to bring their system back into a consistent state are consistency checkers that run continuously in the background. I've used this approach very effectively in situations where events that are used to synchronize state can be dropped. So what you do is build applications that are in charge of reaching out to the various parts of the system and making them agree. They make the world stand still long enough to come to an agreement on some periodic basis. Notice they are not trying to be accurate, the ability to be accurate has been lost, but what's important is that all the different systems agree. If, for example, a user moves a movie from the second position on their queue to the third position on one device and that change hasn't propagated correctly, what matters is that all the systems eventually agree on where the item is in the queue, it doesn't matter so much what the real position is as long as they agree and the user sees a consistent experience. Dangling references can be checked an repaired. Data transforms from failed upgrades can be fixed. Any problems from devices that use different protocol can be fixed. Any order that was approved that may not now have the inventory must be addressed. Any possible incosistency must be coded for, checked, and compensated for. Another cool feature that can be tucked into consistency checkers are aggregation operations, like calculating counts, leader boards, totals, that sort of thing.

Loosely coupled, autonomous. distributed systems are complex beasts that are eventually consistent by nature. Netflix is on the vanguard of innovation here. They have extreme scale, they are transitioning into a cloud system, and they have multiple independent devices that must cooperate. It's great for them to have shared their experiences and how they are tacking their problems with us.

The Problem of Time in Autonomous Systems

One thing this article has brought up for me again is how we have punted on the problem of time. It's too hard to keep clocks in sync so we don't even bother. Vector clocks are the standard technique of deciding which version of data to keep, but in an open, distributed, autonomous system, not all nodes can or will want to participate in this vector clock paradigm.

We actually do have an independent measure that can be used to put an order on events. It's called time. What any device can do is put a very high precision timestamp on data. Maybe it's time to tackle the problem of time again?