advertise
« Strategy: Terminate SSL Connections in Hardware and Reduce Server Count by 40% | Main | Sponsored Post: Okta, EzRez, VoltDB, Digg, Cloud Sigma, Applications Manager, Site24x7 »
Thursday
Aug122010

Think of Latency as a Pseudo-permanent Network Partition

The title of this post is a quote from Ilya Grigorik's post Weak Consistency and CAP Implications. Besides the article being excellent, I thought this idea had something to add to the great NoSQL versus RDBMS debate, where Mike Stonebraker makes the argument that network partitions are rare so designing eventually consistent systems for such rare occurrence is not worth losing ACID semantics over. Even if network partitions are rare, latency between datacenters is not rare, so the game is still on.

The rare-partition argument seems to flow from a centralized-distributed view of systems. Such systems are scale-out in that they grow by adding distributed nodes, but the nodes generally do not cross datacenter boundaries. The assumption is the network is fast enough that distributed operations are roughly homogenous between nodes.

In a fully-distributed system the nodes can be dispersed across datacenters, which gives operations a widely variable performance profile. Because everything talks over TCP/IP, communication between servers is exactly the same, but what changes are characteristics of the channel like latency, bandwidth, and reliability. What also changes are locations of resources like disk storage. A datacenter in Europe, for example, wants to store data locally rather than ship it synchronously across the pond to a US datacenter. Fully distributed systems are very complex because they must manage replication, availability, transactions, fail-over etc. in this very dynamic environment.

The notion is that latency is a kind of partition that requires eventual consistency to mask. The roundtrip time for a packet in a datacenter, for example, might be .5ms, and the roundtrip time from California to Europe back to California might be 150ms. So roughly 300 times more messages can be exchanged in a datacenter versus between datacenters. This implies that reads from different datacenters are unlikely to be consistent.

Imagine a write coursing down two network paths. The first path takes .5ms and the second takes 150ms. Without a strong consistency guarantee, as in a potentially performance/scaling/availability killing two-phase commit, a read will be inconsistent for those latency windows. Ilya writes:

Interestingly enough, dealing with network partitions is not the only case for adopting “weak consistency”. The PNUTS system deployed at Yahoo must deal with WAN replication of data between different continents, and unfortunately, the speed of light imposes some strict latency limits on the performance of such a system. In Yahoo’s case, the communications latency is enough of a performance barrier such that their system is configured, by default, to operate under the “choose availability, under weak consistency” model - think of latency as a pseudo-permanent network partition.

Does this mean geographically disperse systems by their very nature must be eventually consistent? It's at least interesting think about as we slowly make our way towards truly distributed architectures.

Related Articles

Reader Comments (1)

"The rare-partition argument seems to flow from a centralized-distributed view of systems. Such systems are scale-out in that they grow by adding distributed nodes, but the nodes generally do not cross datacenter boundaries. The assumption is the network is fast enough that distributed operations are roughly homogenous between nodes."

The Fallacies of Distributed Computing (http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing)

(1) The network is reliable.
(2) Latency is zero.
(3) Bandwidth is infinite.
(4) The network is secure.
(5) Topology doesn't change.
(6) There is one administrator.
(7) Transport cost is zero.
(8) The network is homogeneous.

Those with "a centralized-distributed view of systems" consistently forget these factors which does indeed lead to assumptions like:

"the network is fast enough that distributed operations are roughly homogenous between nodes."

Even in a single datacentre there is often a substantial difference between inter and intra-rack performance (latency and bandwidth) for any decent sized system which renders this assumption false. This is a side-effect of the dominant approach to datacentre architecture which has a specific cost/performance tradeoff. The VL2 paper (http://research.microsoft.com/pubs/80693/vl2-sigcomm09-final.pdf) covers this quite well:

"Limited server-to-server capacity: As we go up the hierarchy, we are confronted with steep technical and financial barriers in sustaining high bandwidth. Thus, as traffic moves up through the layers of switches and routers, the over-subscription ratio increases rapidly. For example, servers typically have 1:1 over-subscription to other servers in the same rack — that is, they can communicate at the full rate of their interfaces (e.g., 1 Gbps). We found that up-links from ToRs are typically 1:5 to 1:20 oversubscribed (i.e., 1 to 4 Gbps of up-link for 20 servers), and paths through the highest layer of the tree can be 1:240 oversubscribed. This large over-subscription factor fragments the server pool by preventing idle servers from being assigned to overloaded services, and it severely limits the entire data-center’s performance."

August 14, 2010 | Unregistered CommenterDan Creswell

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>