The multi-cores are coming and software designed for fewer cores usually doesn't work on more cores without substantial redesign. For a taste of the issues take a look at No new global mutexes! (and how to make the thread/connection pool work), which shows some of the difficulties of making MySQL perform on SMP servers. In this paper, Richard Smith, a –Staff Engineer at Sun, goes into some nice detail on multi-core issues. His take home lessons are:
Hello, I'm developing a product I thought of. As a part of it, I'm trying to figure out the best architecture for the product. It's a server application, what is supposed to serve A LOT of users at a time. Think of an IM application, but not web style, more like ICQ\MSN applications... With web orientation. I've read all about Meebo, Facebook Chat & etc architectures. But I'm still not sure where to start on ICQ style (this is the first phase before I go totally web). Can you direct me to some information on load balancing 10 million users? ;-) I just don't know here to begin... Thanks, Amit
I follow a certain philosophy when developing system architectures. I assume that very few systems will ever exist in a consistent form for more than a short period of time. What constitutes a “short period of time” differs depending on the specifics of each system, but in an effort to quantify it, I generally find that it falls somewhere between a week and a month. The driving forces behind the need for an ever changing architecture are largely business requirement based. This is a side effect of the reality that software development, in most cases, is used as a supporting role within the business unit it serves. As business requirements (i.e. additional features, new products, etc.) pour forth, it is the developer’s job to evolve their software system to accommodate these requirements and provide a software based solution to whatever problems lay ahead. Given that many businesses can be identified as having the above characteristics, I can now begin to explain why I believe that Heterogeneous System Architectures hold a significant advantage over Homogeneous System Architectures, in many distributed system cases.
At the Oracle Coherence Special Interest Group meeting today in London, Tomas Nilsson, the product manager for JRockit RT and JRockit Mission Control spoke about the future plans for JRockit and especially plans for improved Coherence JRockit integration.
Update 35: How and Why Glue is Using Amazon SimpleDB instead of a Relational Database. Discusses a key design decision that required duplicating data in order to mimic RDBMS joins: Given the trade off between potential inconsistencies and scalability, social services have to choose the latter. Update 34: Apparently Amazon pulled this article. I'm not sure what that means. Maybe time went backwards or something? Amazon dramatically drops SimpleDB pricing to $0.25 per GB per month from $1.50 per GB. This puts SimpleDB on par with Google App Engine. They also announced a few new features: a SQL-like SELECT API as well as a Batch Put operation to streamline uploading of multiple items or attributes. One of the complaints against SimpleDB is that programmers end up writing too much code to do simple things. These features and a much cheaper price should help considerably. And you can store lots of data now. GAE is still capped. Update 33: Amazon announces Elastic Block Store (EBS), which provides lots of normal looking disk along with value added features like snapshots and snapshot copying. But database's may find EBS too slow. RightScale tells us Why Amazon’s Elastic Block Store Matters. Update 32: You can now get all attributes for a property when querying. Previously only the ID was returned and the attributes had to be returned in separate calls. This makes the programmer's job a lot simpler. Artificial levels of parallelization code can now be dumped. Update 31: Amazon fixes a major hole in SimpleDB by adding the ability to sort query results. Previously developers had to sort results by hand which was a non-starter for many. Now you can do basic top 10 type queries with ease. Update 30: Amazon SimpleDB - A distributed, highly-scalable, light-weight, query-able, attribute store by Sebastian Stadil. It introduces the CAP theorem and the basics of SimpleDB. Sebastian does a lot of great work in the AWS world and in what must be his limited free time, runs the AWS Meetup group. Update 29: A stroll down the history of a previous RDBMS killer, object databases. Lots of fond memories of the new kid on the block showing us how objects and code were one, the endless OO vs. relational wars, writing a OODBMS training course, dealing with object migration and querying etc, and the slow decline followed by groveling in front of the old master. It would be a terrible irony if a hash table succeeded where OODBMSs failed. Update 28: I didn't make the beta program :-( Update 27: IBM has hired CouchDB creator Damien Katz as their player in the game. Teams Microsoft, IBM, and Amazon have all entered the race. Amazon is 10 furlongs ahead, but watch for team Google, a fast finisher on the outside. Update 26: Red Monk says Microsoft's Astoria project is SDBish, but developers are afraid of lock-in. Update 25: Nati Shalom thinks SDB isn't even a database. Update 24: Igvita asks why do you need SDB when Thrudb is faster and cheaper? It provides a memcached layer in front of a database storing data in S3. And even better, all its service names start with "thru" instead of "S". Update 23: For all you Perl haters, the Perl interface to SDB is clean and beautiful. Update 22: On an Erlang email list Jim Larson says the proper model is to store bulk data in S3 and indexable metadata in SimpleDB. The cost of SimpleDB is 10x for storing data versus S3. We are supposed to build our own inverted index for text searching, which is one of those decisions that sounds good in the meeting room (yay, we don't have to do all that work), but is not a good decision in the real world. Update 21: Sensepost is already creating attack models to drain your bank account through repeated queries. Update 20: Grow some stones, smoothspan says Eventual Consistency Is Not That Scary. Update 19: Jacob Harris in A First Look at Amazon SimpleDB offers up some beta Ruby libraries for accessing SDB. Update 18: Erlang folks hope to get some run, but Erlang the language is too different to go mainstream, though Erlang's concurrency model rocks. A while back I talked about how The Solution to C++ Threading is Erlang and how Java's concurrency approach is fundamentally broken. Update 17: Subbu tirelessly provides a A RESTful version of Amazon's SimpleDB. Update 16: Snarfed sees it as a sort of tuplespace implementation. Compare it to Facebook's API. Ning also has a data API. Update 15: Uncom thinks Winer & Scoble Fail In Tandem. SDB's XML response has 1,755% transmission overhead, which is genius for a per byte pricing model. And I love this one: if you are starting a business whose success hinges on scalability of a data store, you had best figure out how to shard across N machines before you launch. Using a single instance of MySQL for the whole thing is a strong indicator that you have failed at life. Update 14: Styled Bits sees SDB as more of a way to add metadata to S3 objects. Update 13: Bex Huff makes the point you'll still need a caching layer in front of SDB. Update 12: Shahzad Bhatti has been coding for SimpleDB for a few months and gives us a cool Java and Erlang API for basic CRUD operations. Update 11: DBA4Life says Amazon has just flux capacited us back to 1980s style database management. Update 10: Bob Warfield of SmoothSpan explains Why the Amazon SimpleDB is a Huge Next Step. It helps achieve the necessary "16:1 operations cost advantages over conventional software." Update 9: SimpleDB is berkleyDB and 90% of all computing will live in cloud city. Will the Troglyte's revolt? Update 8: Dave Winer says Amazon removes the database scaling wall by adding a storage ramp that scales up when needed and scales down when unneeded. You no longer need to buy expensive VC funded database talent to take your product to the next level. Update 7: Kevin Burton in Google vs Amazon in Open Infrastructure has doubts about the entire hosted model. Bandwidth costs too much, it might hurt your acquisition chances, and you can't trust 'em. He just wants to lease managed raw machine power. Update 6: Amazon SimpleDB and CouchDB compared. Some key differences: SimpleDB is hosted. CouchDB is REST/JSON and SimpleDB is REST/SOAP/XML. In SimpleDB attribute updates are atomic in CouchDB record updates are atomic. CouchDB supports JSON data types and SimpleDB thinks everything is a string. CouchDB has much more flexible indexing and queries. Update 5: Sriram Krishnan gives a more technical overview of SimpleDB. He likes the big hash table approach and brings up how the query language allows for parallelization. Update 4: Mark from areyouwatchingthis.com makes a really insightful point: I run a startup that gets 75% of our traffic from our API. The ability to move that processing and storage into a cloud _might_ save me a lot on hosting. Update 3: Marcelo Calbucci thinks SimpleDB is more of a directory service than a database because records can contain different attributes (no schema) and attributes can have multiple values. Update 2: Smug Mugs' Don MacAskill likes the service, but is concerned that field sizes are limited to 1024 characters and latency from far away datacenters. He thinks most queries will be easy to convert as they are predominantly hash like lookups anyway. Update: Scoble asks if SimpleDB kills MySQL, Oracle, et al. The answer is no. Google has a similar service internally and they are still major users of and contributors to MySQL. Sometimes you just need structured data. So RDBMSs aren't dead. They just may not be the starting point as the barrier to entry for doing the simplest thing to start a website has plummeted. No more setup or admin. Just code and go. The cherry missing from Amazon's AWS hot fudge sundae was a database service. They had a CPU scoop with EC2, they had storage scoop with S3, they had a work distribution scoop with their queue, but the database cherry was missing. Now they've added it and it's dessert time. News of SimpleDB is everywhere. Apparently it's been in development for a while. You can read about it inside looking out, GIGAOM, Innowave, SimpleDB Developer's Guide, and the SimpleDB Home Page. It seems to be a simple properties like store implemented on Erlang (as is CouchDB). It has simple query capabilities on attributes. It's fast and scalable. And At $0.14 per hour it's quite competitive with other options. What it doesn't have is a text search or complex RDBMS style queries for structured data. It's not clear if the data are geographically distributed, in case you are interested in fast response times from different parts of the world. I would be very curious on the relationship between SimpleDB and Dynamo. Even with these limitations it's a disruptive service. Most high speed websites use a property store for unstructured data and that's been hard for smaller groups to implement at scale. But if you're losing your mind trying to figure out how to store your data at scale, maybe you can now turn your attention to more productive problems.
Marton Trencseni has collected a wonderful list of different papers on distributed systems. He's organized them into the following sections: The Google Papers, Distributed Filesystems, Non-relational Distributed Databases, The Lamport Papers, and Implementation Issues. Many old favorites on the list and some that are likely new to you. My new favorite is "Frangipani: A Scalable Distributed File System." How can you not love "Frangipani" as a word?
"Online games and virtual worlds have familiar scaling requirements, but don’t be fooled: everything you know is wrong." Jim Waldo, Sun Microsystems Laboratories * The computational environment for online games or virtual worlds is close to the exact inverse of that found in most markets serviced by the high-tech industry. * The need for a heavyweight client is, in part, an outcome of the evolution of these games. * Latency is the enemy of fun—and therefore the enemy of online games and virtual worlds. * The game server is used both to discourage cheating (by making it much more difficult) and to detect cheating (by seeing patterns of divergence between the game state reported by the client and the game state held by the server). Peer-to-peer technologies might seem a natural fit for the first role of the game server, but this second role means that few if any games or worlds trust their peers enough to avoid the server component. * Using multiple servers is a basic mechanism for scaling the server component of a game to the levels that are being seen in the online world today. * Having multiple servers means that part of building the game is deciding how to partition the load over these servers. The first technique is to exploit the geography of the game or world. The second technique is known as sharding. * While shards allow scale, they do so at the price of player interaction. * The problem is that the culture that has grown up around games and virtual worlds is not one that understands or is overly familiar with the programming techniques that are required to exploit the parallelism inherent in these systems. * It is for these reasons that we started Project Darkstar (http://www.projectdarkstar.com), a research effort attempting to build a server-side infrastructure that will exploit the multithreaded, multicore chips being produced and scaled over a large group of machines, while presenting the programmer with the illusion that he or she is developing in a single-threaded, single-machine environment. *The model is a simple event-based one in which input from the client is received by the server, which then sets off a task in response to that event. * This mechanism for concurrency control does require that all tasks access all of their data through the Darkstar data service. Our current implementation uses the Berkeley Database. we believe that we can keep the penalty for accessing through a data service small by caching data in intelligent ways. We also believe that by using the inherent parallelism in these games, we can increase the overall performance of the game as the number of players increases, even if there is a small penalty for individual data access. * We found that additional machines lowered the capacity of the overall system. We are working on removing the choke points so that adding equipment actually adds capacity.
Informative and well organized post on caching. Talks about: Why do we need cache?, What is Cache?, Cache Hit, Cache Miss, Storage Cost, Retrieval Cost, Invalidation, Replacement Policy, Optimal Replacement Policy, Caching Algorithms, Least Frequently Used (LFU), Least Recently Used (LRU), Least Recently Used 2(LRU2), Two Queues, Adaptive Replacement Cache (ACR), Most Recently Used (MRU), First in First out (FIFO), Distributed caching, Measuring Cache.
We started with a small site, a mess of open source, and a small team that didn't know much about scaling.
We ended with a large site, a medium sized team, and an architecture that has scaled.
We never stopped. We used a roadmap and a compass, made weekly changes in direction, regularly shipped code on Wednesday to handle the next weekend's capacity constraints, and shipped new features the whole time.
These are excerpts from the IMVU PDF presentation of their architecture which can be viewed or downloaded here.
IMVU is an online destination where adults and teens meet new people in 3D. IMVU won the 2008 Virtual Worlds Innovation Award and was also named a Rising Star in the 2008 Silicon Valley Technology Fast 50 program.
The most important aspect of a scalable web architecture is data partitioning. Most components in a modern data center are completely stateless, meaning they just do batches of work that is handed to them, but don't store any data long-term. This is true of most web application servers, caches like memcached, and all of the network infrastructure that connects them. Data storage is becoming a specialized function, delegated most often to relational databases. This makes sense, because stateless servers are easiest to scale - you just keep adding more. Since they don't store anything, failures are easy to handle too - just take it out of rotation.
Stateful servers require more careful attention. If you are storing all of your data in a relational database, and the load on that database exceeds its capacity, there is no automatic solution that allows you to simply add more hardware and scale up. (One day, there will be, but that's for another post). In the meantime, most websites are building their own scalable clusters using sharding.
Read more on LessonLearned blog.