Netflix: Run Consistency Checkers All the time to Fixup Transactions
Wednesday, April 6, 2011 at 9:09AM
Todd Hoff in Strategy, netflix

You might have consistency problems if you have: multiple datastores in multiple datacenters, without distributed transactions, and with the ability to alternately execute out of each datacenter;  syncing protocols that can fail or sync stale data; distributed clients that cache data and then write old back to the central store; a NoSQL database that doesn't have transactions between updates of multiple related key-value records; application level integrity checks; client driven optimistic locking.

Sounds a lot like many evolving, loosely coupled, autonomous, distributed systems these days. How do you solve these consistency problems? Siddharth "Sid" Anand of Netflix talks about how they solved theirs in his excellent presentation, NoSQL @ Netflix : Part 1, given to a packed crowd at a Cloud Computing Meetup

You might be inclined to say how silly it is to have these problems in the first place, but just hold on. See if you might share some of their problems, before getting all judgy:

How does Netlix handle these consistency issues?

Loosely coupled, autonomous. distributed systems are complex beasts that are eventually consistent by nature. Netflix is on the vanguard of innovation here. They have extreme scale, they are transitioning into a cloud system, and they have multiple independent devices that must cooperate. It's great for them to have shared their experiences and how they are tacking their problems with us.

The Problem of Time in Autonomous Systems

One thing this article has brought up for me again is how we have punted on the problem of time. It's too hard to keep clocks in sync so we don't even bother. Vector clocks are the standard technique of deciding which version of data to keep, but in an open, distributed, autonomous system, not all nodes can or will want to participate in this vector clock paradigm.

We actually do have an independent measure that can be used to put an order on events. It's called time. What any device can do is put a very high precision timestamp on data. Maybe it's time to tackle the problem of time again?

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.