Paper: Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

Consistent hashing is one of those ideas that really puts the science in computer science and reminds us why all those really smart people spend years slaving over algorithms. Consistent hashing is "a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots" and was originally a way of distributing requests among a changing population of web servers. My first reaction to the idea was "wow, that's really smart" and I sadly realized I would never come up with something so elegant. I then immediately saw applications for it everywhere. And consistent hashing is used everywhere: distributed hash tables, overlay networks, P2P, IM, caching, and CDNs. Here's the abstract from the original paper and after the abstract are some links to a few very good articles with accessible explanations of consistent hashing and its applications in the real world.


Abstract:
We describe a family of caching protocols for distributed networks that can be used to decrease or eliminate the occurrence of hot spots in the network. Our protocols are particularly designed for use with very large networks such as the Internet, where delays caused by hot spots can be severe, and where it is not feasible for every server to have complete information about the current state of the entire network. The protocols are easy to implement using existing network protocols such as TCP/IP, and require very little overhead. The protocols work with local control, make efficient use of existing resources, and scale gracefully as the network grows. Our caching protocols are based on a special kind of hashing that we call consistent hashing. Roughly speaking, a consistent hash function is one which changes minimally as the range of the function changes. Through the development of good consistent hash functions, we are able to develop caching protocols which do not require users to have a current or even consistent view of the network. We believe that consistent hash functions may eventually prove to be useful in other applications such as distributed name servers and/or quorum systems.


Other excellent resources for learning more about consistent hashing are at:

  • Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
  • Consistent Hashing by Tom White. A good explanation and some actual Java code as an implementation.
  • Programmer’s Toolbox Part 3: Consistent Hashing by Tom Kleinpeter. Another good explanation with an emphasis on useful applications: load distribution on failure, load tuning by capacity, method for bringing servers on line, redundant caching to protect the database in case of failure.
  • Distributed Hash Tables: an infrastructure that can be used to build more complex services, such as distributed file systems, peer-to-peer file sharing and content distribution systems, cooperative web caching, multicast, anycast, domain name services, and instant messaging. Notable distributed networks that use DHTs include BitTorrent (with extensions), eDonkey network, YaCy, and the Coral Content Distribution Network.
  • Chord - a peer-to-peer lookup algorithm. It allows a distributed set of participants to agree on a single node as a rendezvous point for a given key, without any central coordination.
  • Dynamo, Amazon's database uses consistent hashing.
  • Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution