Henry Robinson has created an excellent series of articles on consensus protocols. Henry starts with a very useful discussion of what all this talk about consensus really means: The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network. In this article Henry tackles Two-Phase Commit, the protocol most databases use to arrive at a consensus for database writes. The article is very well written with lots of pretty and informative pictures. He did a really good job. In conclusion we learn 2PC is very efficient, a minimal number of messages are exchanged and latency is low. The problem is when a co-ordinator fails availability is dramatically reduced. This is why 2PC isn't generally used on highly distributed systems. To solve that problem we have to move on to different algorithms and that is the subject of other articles.
Update: Load Balancing in Amazon EC2 with HAProxy. Grig Gheorghiu writes a nice post on HAProxy functionality and configuration: Emulating virtual servers, Logging, SSL, Load balancing algorithms, Session persistence with cookies, Server health checks, etc. Adapted From the website: HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting tens of thousands of connections is clearly realistic with todays hardware. Its mode of operation makes its integration into existing architectures very easy and riskless, while still offering the possibility not to expose fragile web servers to the Net. Currently, two major versions are supported : * version 1.1 - maintains critical sites online since 200 The most stable and reliable, has reached years of uptime. Receives no new feature, dedicated to mission-critical usages only. * version 1.2 - opening the way to very high traffic sites The same as 1.1 with some new features such as poll/epoll support for very large number of sessions, IPv6 on the client side, application cookies, hot-reconfiguration, advanced dynamic load regulation, TCP keepalive, source hash, weighted load balancing, rbtree-based scheduler, and a nice Web status page. This code is still evolving but has significantly stabilized since 1.2.8. Unlike other free "cheap" load-balancing solutions, this product is only used by a few hundreds of people around the world, but those people run very big sites serving several millions hits and between several tens of gigabytes to several terabytes per day to hundreds of thousands of clients. They need 24x7 availability and have internal skills to risk to maintain a free software solution. Often, the solution is deployed for internal uses and I only know about it when they send me some positive feedback or when they ask for a missing feature ;-) According to many users HAProxy competes quite well with the likes of Pound and Ultramonkey.
Beta testers wanted for ultra high-scalability/performance clustered object storage system designed for web content delivery
DataDirect Networks (www.ddn.com) is searching for beta testers for our exciting new object-based clustered storage system. Does this sound like you? * Need to store millions to hundreds of billions of files * Want to use one big file system but can't because no single file system scales big enough * Running out of inodes * Have to constantly tweak file systems to perform better * Need to replicate content to more than one data center across geographies * Have thumbnail images or other small files that wreak havoc on your file and storage systems * Constantly tweaking and engineering around performance and scalability limits * No storage system delivers enough IOPS to serve your content * Spend time load balancing the storage environment * Want a single, simple way to manage all this data If this sounds like you, please contact me at email@example.com. DataDirect Networks is a 10-year old, well-established storage systems company specializing in Extreme Storage environments. We've deployed both the largest and the fastest storage/file systems on the planet - currently running at over 250GB/s. Our upcoming product is going to change the way storage is deployed for scalable web content and we're seeking testers who can throw their most challenging problems at our new system. It's time for something better and we're going to deliver it.
Update:How-To Minimize Load Time for Fast User Experiences. Shows how to analyze the bottlenecks preventing websites and blogs from loading quickly and how to resolve them. 80-90% of the end-user response time is spent on the frontend, so it makes sense to concentrate efforts there before heroically rewriting the backend. Take a shower before buying a Porsche, if you know what I mean. Steve Souders, author of High Performance Websites and Yslow, has ten more best practices to speed up your website:
To scale in the large you have to partition. Data has to be spread around, replicated, and kept consistent (keeping replicas sufficiently similar to one another despite operations being submitted independently at different sites). The result is a highly available, well performing, and scalable system. Partitioning is required, but it's a pain to do efficiently and correctly. Until Quantum teleportation becomes a reality how data is kept consistent across a bewildering number of failure scenarios is a key design decision. This excellent paper by Yasushi Saito and Marc Shapiro takes us on a wild ride (OK, maybe not so wild) of different approaches to achieving consistency. What's cool about this paper is they go over some real systems that we are familiar with and cover how they work: DNS (single-master, state-transfer), Usenet (multi-master), PDAs (multi-master, state-transfer, manual or application-specific conflict resolution), Bayou (multi-master, operation-transfer, epidemic propagation, application conflict resolution), CVS (multi-master operation-transfer, centralized, manual conflict resolution). The paper then goes on to explain in detail the different approaches to achieving consistency. Most of us will never have to write the central nervous system of an application like this, but knowing about the different approaches and tradesoffs is priceless. The abstract:
Data replication is a key technology in distributed data sharing systems, enabling higher availability and performance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short term, in order to support concurrent work practices and to tolerate failures in low-quality communication links. The importance of such techniques is increasing as collaboration through wide-area and mobile networks becomes popular. Optimistic replication techniques are different from traditional “pessimistic” ones. Instead of synchronous replica coordination, an optimistic algorithm propagates changes in the background, discovers conflicts after they happen and reaches agreement on the final contents incrementally. We explore the solution space for optimistic replication algorithms. This paper identifies key challenges facing optimistic replication systems — ordering operations, detecting and resolving conflicts, propagating changes efficiently, and bounding replica divergence—and provides a comprehensive survey of techniques developed for addressing these challenges.If you can't wait to know the ending, here's the summary of the paper:
We summarize some of the lessons learned from our own experience and in reviewing the literature. Optimistic, asynchronous data replication is an appealing technique; it indeed improves networking flexibility and scalability. Some environments or application areas could simply not function without optimistic replication. However, optimistic replication also comes with a cost. The algorithmic complexity of ensuring eventual consistency can be high. Conflicts usually require application-specific resolution, and the lost update problem is ultimately unavoidable. Hence our recommendations: (1) Keep it simple. Traditional, pessimistic replication, with many off-the-shelf solutions, is perfectly adequate in small-scale, fully connected, reliable networking environments. Where pessimistic techniques are the cause of poor performance or lack of availability, or do not scale well, try single-master replication: it is simple, conflictfree, and scales well in practice. State transfer using Thomas’s write rule works well for many applications. Advanced techniques such as version vectors and operation transfer should be used only when you need flexibility and semantically rich conflict resolution. (2) Propagate operations quickly to avoid conflicts. While connected, propagate often and keep replicas in close synchronization. This will minimize divergence when disconnection does occur. (3) Exploit commutativity. Commutativity should be the default; design your system so that non-commutative operations are the uncommon case. For instance, whenever possible, partition data into small, independent objects. Within an object, use monotonic data structures such as an append-only log, a monotonically increasing counter, or a union-only set. When operations are dependent upon each other, represent the invariants explicitly.
Yes, I just got through watching the Superbowl so chips and salsa are on my mind and in my stomach. In recreational eating more chips requires downing more salsa. With mulitcore chips it turns out as cores go up salsa goes down, salsa obviously being a metaphor for speed. Sandia National Laboratories found in their simulations: a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added. The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor. The implication for those following a diagonal scaling strategy is to work like heck to make your system fit within eight multicores. After that you'll need to consider some sort of partitioning strategy. What's interesting is the research on where the cutoff point will be.
The 5th annual MySQL Conference & Expo, co-presented by Sun Microsystems, MySQL and O'Reilly Media. Happening April 20-23, 2009 in Santa Clara, CA, at the Santa Clara Convention Center and Hyatt Regency Santa Clara, brings over 2,000 open source and database enthusiasts together to harness the power of MySQL and celebrate the huge MySQL ecosystem. All around the world, people just like you are innovating with MySQL—and MySQL is fueling the innovation engine by releasing new mission critical solutions to help you work smarter. This deeply technical conference brings all of that creativity, energy, and knowledge together in one place for four very full days. Early registration ends February 16, 2009. The largest gathering of MySQL developers, users, and DBAs worldwide, the event reflects MySQL's wide-ranging appeal and capabilities. The open atmosphere of the MySQL Conference & Expo helps IT professionals and community members launch and develop the best database applications, tools, and software. As companies of all sizes look for ways to remain competitive and manage costs, open source software and tools provide valuable and efficient solutions for the enterprise. The 2009 edition of the MySQL Conference & Expo will present strategies for businesses to not just survive, but thrive in a challenging economy. Through expert instruction, hands-on tutorials, and readily available MySQL developers, users at all levels gain the knowledge they need to rapidly build solid applications with MySQL that scale with the enterprise. New to the 2009 program will be MySQL Camp, a space where any and all participants can create an "unconference" within the larger event.
The multi-cores are coming and software designed for fewer cores usually doesn't work on more cores without substantial redesign. For a taste of the issues take a look at No new global mutexes! (and how to make the thread/connection pool work), which shows some of the difficulties of making MySQL perform on SMP servers. In this paper, Richard Smith, a –Staff Engineer at Sun, goes into some nice detail on multi-core issues. His take home lessons are:
Hello, I'm developing a product I thought of. As a part of it, I'm trying to figure out the best architecture for the product. It's a server application, what is supposed to serve A LOT of users at a time. Think of an IM application, but not web style, more like ICQ\MSN applications... With web orientation. I've read all about Meebo, Facebook Chat & etc architectures. But I'm still not sure where to start on ICQ style (this is the first phase before I go totally web). Can you direct me to some information on load balancing 10 million users? ;-) I just don't know here to begin... Thanks, Amit