Moshe Kaplan of RockeTier shows the life cycle of an affiliate marketing system that starts off as a cub handling one million events per day and ends up a lion handling 200 million to even one billion events per day. The resulting system uses ten commodity servers at a cost of $35,000. Mr. Kaplan's paper is especially interesting because it documents a system architecture evolution we may see a lot more of in the future: database centric --> cache centric --> memory grid. As scaling and performance requirements for complicated operations increase, leaving the entire system in memory starts to make a great deal of sense. Why use cache at all? Why shouldn't your system be all in memory from the start?
Joe Stump, Lead Architect at Digg, gave this presentation at the Web 2.0 Expo. I couldn't find the actual presentation, but fortunately Kris Jordan took some great notes. That's how key moments in history are accidentally captured forever. Joe was also kind enough to respond to my email questions with a phone call. In this first part of the post Joe shares some timeless wisdom that you may or may not have read before. I of course take some pains to extract all the wit from the original presentation in favor of simple rules. What really struck me however was how Joe thought MemcacheDB Will be the biggest new kid on the block in scaling. MemcacheDB has been around for a little while and I've never thought of it in that way. Well learn why Joe is so excited by MemcacheDB at the end of the post.
MemcacheDB: Evolutionary Step for Code, Revolutionary Step for PerformanceImagine Kevin Rose, the founder of Digg, who at the time of this presentation had 40,000 followers. If Kevin diggs just once a day that's 40,000 writes. As the most active diggers are the most followed it becomes a huge performance bottleneck. Two problems appear. You can't update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that. The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That's 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers. With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB's own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it's easy to see why Joe was so excited about MemcacheDB's ability to handle their digg deluge. What is MemcacheDB? It's a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval. It conforms to memcache protocol(not completed, see below), so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported. Before you get too excited keep in mind this is a key-value store. You read and write records by a single key. There aren't multiple indexes and there's no SQL. That's why it can be so fast. Digg uses MemcacheDB to scale out the huge number of writes that happen when data is denormalized. Remember it's a key-value store. The value is usually a complete application level object merged together from a possibly large number of normalized tables. Denormalizing introduces redundancies because you are keeping copies of data in multiple records instead of just one copy in a nicely normalized table. So denormalization means a lot more writes as data must be copied to all the records that contain a copy. To keep up they needed a database capable of handling their write load. MemcacheDB has the performance, especially when you layer memcached's normal partitioning scheme on top. I asked Joe why he didn't turn to one of the in-memory data grid solutions? Some of the reasons were:
Update:Presentation: Behind the Scenes at MySpace.com. Dan Farino, Chief Systems Architect at MySpace shares details of some of MySpace's cool internal operations tools. MySpace.com is one of the fastest growing site on the Internet with 65 million subscribers and 260,000 new users registering each day. Often criticized for poor performance, MySpace has had to tackle scalability issues few other sites have faced. How did they do it? Site: http://myspace.com
Henry Robinson has created an excellent series of articles on consensus protocols. Henry starts with a very useful discussion of what all this talk about consensus really means: The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network. In this article Henry tackles Two-Phase Commit, the protocol most databases use to arrive at a consensus for database writes. The article is very well written with lots of pretty and informative pictures. He did a really good job. In conclusion we learn 2PC is very efficient, a minimal number of messages are exchanged and latency is low. The problem is when a co-ordinator fails availability is dramatically reduced. This is why 2PC isn't generally used on highly distributed systems. To solve that problem we have to move on to different algorithms and that is the subject of other articles.
Update: Load Balancing in Amazon EC2 with HAProxy. Grig Gheorghiu writes a nice post on HAProxy functionality and configuration: Emulating virtual servers, Logging, SSL, Load balancing algorithms, Session persistence with cookies, Server health checks, etc. Adapted From the website: HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting tens of thousands of connections is clearly realistic with todays hardware. Its mode of operation makes its integration into existing architectures very easy and riskless, while still offering the possibility not to expose fragile web servers to the Net. Currently, two major versions are supported : * version 1.1 - maintains critical sites online since 200 The most stable and reliable, has reached years of uptime. Receives no new feature, dedicated to mission-critical usages only. * version 1.2 - opening the way to very high traffic sites The same as 1.1 with some new features such as poll/epoll support for very large number of sessions, IPv6 on the client side, application cookies, hot-reconfiguration, advanced dynamic load regulation, TCP keepalive, source hash, weighted load balancing, rbtree-based scheduler, and a nice Web status page. This code is still evolving but has significantly stabilized since 1.2.8. Unlike other free "cheap" load-balancing solutions, this product is only used by a few hundreds of people around the world, but those people run very big sites serving several millions hits and between several tens of gigabytes to several terabytes per day to hundreds of thousands of clients. They need 24x7 availability and have internal skills to risk to maintain a free software solution. Often, the solution is deployed for internal uses and I only know about it when they send me some positive feedback or when they ask for a missing feature ;-) According to many users HAProxy competes quite well with the likes of Pound and Ultramonkey.
Beta testers wanted for ultra high-scalability/performance clustered object storage system designed for web content delivery
DataDirect Networks (www.ddn.com) is searching for beta testers for our exciting new object-based clustered storage system. Does this sound like you? * Need to store millions to hundreds of billions of files * Want to use one big file system but can't because no single file system scales big enough * Running out of inodes * Have to constantly tweak file systems to perform better * Need to replicate content to more than one data center across geographies * Have thumbnail images or other small files that wreak havoc on your file and storage systems * Constantly tweaking and engineering around performance and scalability limits * No storage system delivers enough IOPS to serve your content * Spend time load balancing the storage environment * Want a single, simple way to manage all this data If this sounds like you, please contact me at firstname.lastname@example.org. DataDirect Networks is a 10-year old, well-established storage systems company specializing in Extreme Storage environments. We've deployed both the largest and the fastest storage/file systems on the planet - currently running at over 250GB/s. Our upcoming product is going to change the way storage is deployed for scalable web content and we're seeking testers who can throw their most challenging problems at our new system. It's time for something better and we're going to deliver it.
Update:How-To Minimize Load Time for Fast User Experiences. Shows how to analyze the bottlenecks preventing websites and blogs from loading quickly and how to resolve them. 80-90% of the end-user response time is spent on the frontend, so it makes sense to concentrate efforts there before heroically rewriting the backend. Take a shower before buying a Porsche, if you know what I mean. Steve Souders, author of High Performance Websites and Yslow, has ten more best practices to speed up your website:
To scale in the large you have to partition. Data has to be spread around, replicated, and kept consistent (keeping replicas sufficiently similar to one another despite operations being submitted independently at different sites). The result is a highly available, well performing, and scalable system. Partitioning is required, but it's a pain to do efficiently and correctly. Until Quantum teleportation becomes a reality how data is kept consistent across a bewildering number of failure scenarios is a key design decision. This excellent paper by Yasushi Saito and Marc Shapiro takes us on a wild ride (OK, maybe not so wild) of different approaches to achieving consistency. What's cool about this paper is they go over some real systems that we are familiar with and cover how they work: DNS (single-master, state-transfer), Usenet (multi-master), PDAs (multi-master, state-transfer, manual or application-specific conflict resolution), Bayou (multi-master, operation-transfer, epidemic propagation, application conflict resolution), CVS (multi-master operation-transfer, centralized, manual conflict resolution). The paper then goes on to explain in detail the different approaches to achieving consistency. Most of us will never have to write the central nervous system of an application like this, but knowing about the different approaches and tradesoffs is priceless. The abstract:
Data replication is a key technology in distributed data sharing systems, enabling higher availability and performance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short term, in order to support concurrent work practices and to tolerate failures in low-quality communication links. The importance of such techniques is increasing as collaboration through wide-area and mobile networks becomes popular. Optimistic replication techniques are different from traditional “pessimistic” ones. Instead of synchronous replica coordination, an optimistic algorithm propagates changes in the background, discovers conflicts after they happen and reaches agreement on the final contents incrementally. We explore the solution space for optimistic replication algorithms. This paper identifies key challenges facing optimistic replication systems — ordering operations, detecting and resolving conflicts, propagating changes efficiently, and bounding replica divergence—and provides a comprehensive survey of techniques developed for addressing these challenges.If you can't wait to know the ending, here's the summary of the paper:
We summarize some of the lessons learned from our own experience and in reviewing the literature. Optimistic, asynchronous data replication is an appealing technique; it indeed improves networking flexibility and scalability. Some environments or application areas could simply not function without optimistic replication. However, optimistic replication also comes with a cost. The algorithmic complexity of ensuring eventual consistency can be high. Conflicts usually require application-specific resolution, and the lost update problem is ultimately unavoidable. Hence our recommendations: (1) Keep it simple. Traditional, pessimistic replication, with many off-the-shelf solutions, is perfectly adequate in small-scale, fully connected, reliable networking environments. Where pessimistic techniques are the cause of poor performance or lack of availability, or do not scale well, try single-master replication: it is simple, conflictfree, and scales well in practice. State transfer using Thomas’s write rule works well for many applications. Advanced techniques such as version vectors and operation transfer should be used only when you need flexibility and semantically rich conflict resolution. (2) Propagate operations quickly to avoid conflicts. While connected, propagate often and keep replicas in close synchronization. This will minimize divergence when disconnection does occur. (3) Exploit commutativity. Commutativity should be the default; design your system so that non-commutative operations are the uncommon case. For instance, whenever possible, partition data into small, independent objects. Within an object, use monotonic data structures such as an append-only log, a monotonically increasing counter, or a union-only set. When operations are dependent upon each other, represent the invariants explicitly.
Yes, I just got through watching the Superbowl so chips and salsa are on my mind and in my stomach. In recreational eating more chips requires downing more salsa. With mulitcore chips it turns out as cores go up salsa goes down, salsa obviously being a metaphor for speed. Sandia National Laboratories found in their simulations: a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added. The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor. The implication for those following a diagonal scaling strategy is to work like heck to make your system fit within eight multicores. After that you'll need to consider some sort of partitioning strategy. What's interesting is the research on where the cutoff point will be.
The 5th annual MySQL Conference & Expo, co-presented by Sun Microsystems, MySQL and O'Reilly Media. Happening April 20-23, 2009 in Santa Clara, CA, at the Santa Clara Convention Center and Hyatt Regency Santa Clara, brings over 2,000 open source and database enthusiasts together to harness the power of MySQL and celebrate the huge MySQL ecosystem. All around the world, people just like you are innovating with MySQL—and MySQL is fueling the innovation engine by releasing new mission critical solutions to help you work smarter. This deeply technical conference brings all of that creativity, energy, and knowledge together in one place for four very full days. Early registration ends February 16, 2009. The largest gathering of MySQL developers, users, and DBAs worldwide, the event reflects MySQL's wide-ranging appeal and capabilities. The open atmosphere of the MySQL Conference & Expo helps IT professionals and community members launch and develop the best database applications, tools, and software. As companies of all sizes look for ways to remain competitive and manage costs, open source software and tools provide valuable and efficient solutions for the enterprise. The 2009 edition of the MySQL Conference & Expo will present strategies for businesses to not just survive, but thrive in a challenging economy. Through expert instruction, hands-on tutorials, and readily available MySQL developers, users at all levels gain the knowledge they need to rapidly build solid applications with MySQL that scale with the enterprise. New to the 2009 program will be MySQL Camp, a space where any and all participants can create an "unconference" within the larger event.