Hi, We are running a backup solution that uploads every night the files our clients worked on during the day (Cabonite-like). We have currently about 10GB of data per night, via http PUT requests (1 per file), and the files are written as-is on a NAS. Our architecture is basically compound of a load balancer (hardware, sticky sessions), 5 servers (Tomcat under RHEL4/5, ) and a NAS (nfs 3). Since our number of clients is rising, (as is our system load) how would you recommend we could scale our infrastructure? hardware and software? Should we go towards NAS sharding, more servers, NIO on tomcat...? Thanks for your inputs!
Google AppEngine Numbers
This group of numbers is from Brett Slatkin in Building Scalable Web Apps with Google App Engine.
Writes are expensive!
* The size and shape of your data
* Doing work in batches (batch puts and gets)
Reads are cheap!
* For a 1MB entity, that's 4000 fetches/sec
Numbers MiscellaneousThis group of numbers is from a presentation Jeff Dean gave at a Engineering All-Hands Meeting at Google.
The TechniquesKeep in mind these are from a Google AppEngine perspective, but the ideas are generally applicable.
Sharded CountersWe always seem to want to keep count of things. But BigTable doesn't keep a count of entities because it's a key-value store. It's very good at getting data by keys, it's not interested in how many you have. So the job of keeping counts is shifted to you.
The naive counter implementation is to lock-read-increment-write. This is fine if there a low number of writes. But if there are frequent updates there's high contention. Given the the number of writes that can be made per second is so limited, a high write load serializes and slows down the whole process.
The solution is to shard counters. This means:
This approach seems counter-intuitive because we are used to a counter being a single incrementable variable. Reads are cheap so we replace having a single easily read counter with having to make multiple reads to recover the actual count. Frequently updated shared variables are expensive so we shard and parallelize those writes.
With a centralized database letting the database be the source of sequence numbers is doable. But to scale writes you need to partition and once you partition it becomes difficult to keep any shared state like counters. You might argue that so common a feature should be provided by GAE and I would agree 100 percent, but it's the ideas that count (pun intended).
Paging Through Comments
How can comments be stored such that they can be paged through
in roughly the order they were entered?
Under a high write load situation this is a surprisingly hard question to answer. Obviously what you want is just a counter. As a comment is made you get a sequence number and that's the order comments are displayed. But as we saw in the last section shared state like a single counter won't scale in high write environments.
A sharded counter won't work in this situation either because summing the shared counters isn't transactional. There's no way to guarantee each comment will get back the sequence number it allocated so we could have duplicates.
Searches in BigTable return data in alphabetical order. So what is needed for a key is something unique and alphabetical so when searching through comments you can go forward and backward using only keys.
A lot of paging algorithms use counts. Give me records 1-20, 21-30, etc. SQL makes this easy, but it doesn't work for BigTable. BigTable knows how to get things by keys so you must make keys that return data in the proper order.
In the grand old tradition of making unique keys we just keep appending stuff until it becomes unique. The suggested key for GAE is: time stamp + user ID + user comment ID.
Ordering by date is obvious. The good thing is getting a time stamp is a local decision, it doesn't rely on writes and is scalable. The problem is timestamps are not unique, especially with a lot of users.
So we can add the user name to the key to distinguish it from all other comments made at the same time. We already have the user name so this too is a cheap call.
Theoretically even time stamps for a single user aren't sufficient. What we need then is a sequence number for each user's comments.
And this is where the GAE solution turns into something totally unexpected. Our goal is to remove write contention so we want to parallelize writes. And we have a lot available storage so we don't have to worry about that.
With these forces in mind, the idea is to create a counter per user. When a user adds a comment it's added to a user's comment list and a sequence number is allocated. Comments are added in a transactional context on a per user basis using Entity Groups. So each comment add is guaranteed to be unique because updates in an Entity Group are serialized.
The resulting key is guaranteed unique and sorts properly in alphabetical order. When paging a query is made across entity groups using the ID index. The results will be in the correct order. Paging is a matter of getting the previous and next keys in the query for the current page. These keys can then be used to move through index.
I certainly would have never thought of this approach. The idea of keeping per user comment indexes is out there. But it cleverly follows the rules of scaling in a distributed system. Writes and reads are done in parallel and that's the goal. Write contention is removed.
Moshe Kaplan of RockeTier shows the life cycle of an affiliate marketing system that starts off as a cub handling one million events per day and ends up a lion handling 200 million to even one billion events per day. The resulting system uses ten commodity servers at a cost of $35,000. Mr. Kaplan's paper is especially interesting because it documents a system architecture evolution we may see a lot more of in the future: database centric --> cache centric --> memory grid. As scaling and performance requirements for complicated operations increase, leaving the entire system in memory starts to make a great deal of sense. Why use cache at all? Why shouldn't your system be all in memory from the start?
General Approach to Evolving the System to Scale
One Million Event Per Day System
2.5 Million Event Per Day System
20 Million Event Per Day System
200 Million Event Per Day System
Joe Stump, Lead Architect at Digg, gave this presentation at the Web 2.0 Expo. I couldn't find the actual presentation, but fortunately Kris Jordan took some great notes. That's how key moments in history are accidentally captured forever. Joe was also kind enough to respond to my email questions with a phone call. In this first part of the post Joe shares some timeless wisdom that you may or may not have read before. I of course take some pains to extract all the wit from the original presentation in favor of simple rules. What really struck me however was how Joe thought MemcacheDB Will be the biggest new kid on the block in scaling. MemcacheDB has been around for a little while and I've never thought of it in that way. Well learn why Joe is so excited by MemcacheDB at the end of the post.
MemcacheDB: Evolutionary Step for Code, Revolutionary Step for PerformanceImagine Kevin Rose, the founder of Digg, who at the time of this presentation had 40,000 followers. If Kevin diggs just once a day that's 40,000 writes. As the most active diggers are the most followed it becomes a huge performance bottleneck. Two problems appear. You can't update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that. The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That's 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers. With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB's own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it's easy to see why Joe was so excited about MemcacheDB's ability to handle their digg deluge. What is MemcacheDB? It's a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval. It conforms to memcache protocol(not completed, see below), so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported. Before you get too excited keep in mind this is a key-value store. You read and write records by a single key. There aren't multiple indexes and there's no SQL. That's why it can be so fast. Digg uses MemcacheDB to scale out the huge number of writes that happen when data is denormalized. Remember it's a key-value store. The value is usually a complete application level object merged together from a possibly large number of normalized tables. Denormalizing introduces redundancies because you are keeping copies of data in multiple records instead of just one copy in a nicely normalized table. So denormalization means a lot more writes as data must be copied to all the records that contain a copy. To keep up they needed a database capable of handling their write load. MemcacheDB has the performance, especially when you layer memcached's normal partitioning scheme on top. I asked Joe why he didn't turn to one of the in-memory data grid solutions? Some of the reasons were:
Update:Presentation: Behind the Scenes at MySpace.com. Dan Farino, Chief Systems Architect at MySpace shares details of some of MySpace's cool internal operations tools. MySpace.com is one of the fastest growing site on the Internet with 65 million subscribers and 260,000 new users registering each day. Often criticized for poor performance, MySpace has had to tackle scalability issues few other sites have faced. How did they do it? Site: http://myspace.com
Henry Robinson has created an excellent series of articles on consensus protocols. Henry starts with a very useful discussion of what all this talk about consensus really means: The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network. In this article Henry tackles Two-Phase Commit, the protocol most databases use to arrive at a consensus for database writes. The article is very well written with lots of pretty and informative pictures. He did a really good job. In conclusion we learn 2PC is very efficient, a minimal number of messages are exchanged and latency is low. The problem is when a co-ordinator fails availability is dramatically reduced. This is why 2PC isn't generally used on highly distributed systems. To solve that problem we have to move on to different algorithms and that is the subject of other articles.
Update: Load Balancing in Amazon EC2 with HAProxy. Grig Gheorghiu writes a nice post on HAProxy functionality and configuration: Emulating virtual servers, Logging, SSL, Load balancing algorithms, Session persistence with cookies, Server health checks, etc. Adapted From the website: HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting tens of thousands of connections is clearly realistic with todays hardware. Its mode of operation makes its integration into existing architectures very easy and riskless, while still offering the possibility not to expose fragile web servers to the Net. Currently, two major versions are supported : * version 1.1 - maintains critical sites online since 200 The most stable and reliable, has reached years of uptime. Receives no new feature, dedicated to mission-critical usages only. * version 1.2 - opening the way to very high traffic sites The same as 1.1 with some new features such as poll/epoll support for very large number of sessions, IPv6 on the client side, application cookies, hot-reconfiguration, advanced dynamic load regulation, TCP keepalive, source hash, weighted load balancing, rbtree-based scheduler, and a nice Web status page. This code is still evolving but has significantly stabilized since 1.2.8. Unlike other free "cheap" load-balancing solutions, this product is only used by a few hundreds of people around the world, but those people run very big sites serving several millions hits and between several tens of gigabytes to several terabytes per day to hundreds of thousands of clients. They need 24x7 availability and have internal skills to risk to maintain a free software solution. Often, the solution is deployed for internal uses and I only know about it when they send me some positive feedback or when they ask for a missing feature ;-) According to many users HAProxy competes quite well with the likes of Pound and Ultramonkey.
Beta testers wanted for ultra high-scalability/performance clustered object storage system designed for web content delivery
DataDirect Networks (www.ddn.com) is searching for beta testers for our exciting new object-based clustered storage system. Does this sound like you? * Need to store millions to hundreds of billions of files * Want to use one big file system but can't because no single file system scales big enough * Running out of inodes * Have to constantly tweak file systems to perform better * Need to replicate content to more than one data center across geographies * Have thumbnail images or other small files that wreak havoc on your file and storage systems * Constantly tweaking and engineering around performance and scalability limits * No storage system delivers enough IOPS to serve your content * Spend time load balancing the storage environment * Want a single, simple way to manage all this data If this sounds like you, please contact me at firstname.lastname@example.org. DataDirect Networks is a 10-year old, well-established storage systems company specializing in Extreme Storage environments. We've deployed both the largest and the fastest storage/file systems on the planet - currently running at over 250GB/s. Our upcoming product is going to change the way storage is deployed for scalable web content and we're seeking testers who can throw their most challenging problems at our new system. It's time for something better and we're going to deliver it.
Update:How-To Minimize Load Time for Fast User Experiences. Shows how to analyze the bottlenecks preventing websites and blogs from loading quickly and how to resolve them. 80-90% of the end-user response time is spent on the frontend, so it makes sense to concentrate efforts there before heroically rewriting the backend. Take a shower before buying a Porsche, if you know what I mean. Steve Souders, author of High Performance Websites and Yslow, has ten more best practices to speed up your website:
To scale in the large you have to partition. Data has to be spread around, replicated, and kept consistent (keeping replicas sufficiently similar to one another despite operations being submitted independently at different sites). The result is a highly available, well performing, and scalable system. Partitioning is required, but it's a pain to do efficiently and correctly. Until Quantum teleportation becomes a reality how data is kept consistent across a bewildering number of failure scenarios is a key design decision. This excellent paper by Yasushi Saito and Marc Shapiro takes us on a wild ride (OK, maybe not so wild) of different approaches to achieving consistency. What's cool about this paper is they go over some real systems that we are familiar with and cover how they work: DNS (single-master, state-transfer), Usenet (multi-master), PDAs (multi-master, state-transfer, manual or application-specific conflict resolution), Bayou (multi-master, operation-transfer, epidemic propagation, application conflict resolution), CVS (multi-master operation-transfer, centralized, manual conflict resolution). The paper then goes on to explain in detail the different approaches to achieving consistency. Most of us will never have to write the central nervous system of an application like this, but knowing about the different approaches and tradesoffs is priceless. The abstract:
Data replication is a key technology in distributed data sharing systems, enabling higher availability and performance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short term, in order to support concurrent work practices and to tolerate failures in low-quality communication links. The importance of such techniques is increasing as collaboration through wide-area and mobile networks becomes popular. Optimistic replication techniques are different from traditional “pessimistic” ones. Instead of synchronous replica coordination, an optimistic algorithm propagates changes in the background, discovers conflicts after they happen and reaches agreement on the final contents incrementally. We explore the solution space for optimistic replication algorithms. This paper identifies key challenges facing optimistic replication systems — ordering operations, detecting and resolving conflicts, propagating changes efficiently, and bounding replica divergence—and provides a comprehensive survey of techniques developed for addressing these challenges.If you can't wait to know the ending, here's the summary of the paper:
We summarize some of the lessons learned from our own experience and in reviewing the literature. Optimistic, asynchronous data replication is an appealing technique; it indeed improves networking flexibility and scalability. Some environments or application areas could simply not function without optimistic replication. However, optimistic replication also comes with a cost. The algorithmic complexity of ensuring eventual consistency can be high. Conflicts usually require application-specific resolution, and the lost update problem is ultimately unavoidable. Hence our recommendations: (1) Keep it simple. Traditional, pessimistic replication, with many off-the-shelf solutions, is perfectly adequate in small-scale, fully connected, reliable networking environments. Where pessimistic techniques are the cause of poor performance or lack of availability, or do not scale well, try single-master replication: it is simple, conflictfree, and scales well in practice. State transfer using Thomas’s write rule works well for many applications. Advanced techniques such as version vectors and operation transfer should be used only when you need flexibility and semantically rich conflict resolution. (2) Propagate operations quickly to avoid conflicts. While connected, propagate often and keep replicas in close synchronization. This will minimize divergence when disconnection does occur. (3) Exploit commutativity. Commutativity should be the default; design your system so that non-commutative operations are the uncommon case. For instance, whenever possible, partition data into small, independent objects. Within an object, use monotonic data structures such as an append-only log, a monotonically increasing counter, or a union-only set. When operations are dependent upon each other, represent the invariants explicitly.