In The Architectures You've Always Wondered About track at the Qcon conference, Second Life, eBay, Yahoo, LinkedIn and Orbitz presented how they dealt with different aspects of their applications, such as scalability. There were quite a few lessons that I learned that day that I thought were worth sharing. The details are provided below: Lessons from Yahoo, eBay, Orbitz, LinkedIn architecture
Dryad is Microsoft's answer to Google's map-reduce. What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars? Dryad models programs as the execution of a directed acyclic graph. Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). Map-reduce uses a different model. It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. Each approach seems to borrow from the spirit of its creating organization. The graph approach seems a bit too complicated and map-reduce seems a bit too simple. How ironic, in the Alanis Morissette sense. Dryad is a middleware layer that executes graphs for you, automatically taking care of scheduling, distribution, and fault tolerance. It's written in C++, but apparently few write directly to this layer, most people use higher layer interfaces. A Job Manager runs the program. It's a library you link in and it loads and executes the graph. A daemon runs on each machine to run jobs. A name server provides access to cluster resources. The DAG is a multigraph so you can have multiple edges between vertices. A DAG was chosen because it's not too cold, or too hot, the porridge is just right. Cycles are too hard. Simpler isn't as useful. DAGs support relational algebra and can split multiple inputs and outputs nicely. One interesting aspect is a a channel is a sequence of structure items that are C++ objects. This means pointers can be passed directly so you don't have to worry about serialization overhead. No restrictions are put on the data model. Graphs are dynamically changeable at runtime which allows for a lot of optimizations. Several case studies were provided. It's probably just me, but I didn't really understand what was going on. Google's example is much better. Everyone can relate to counting words in a document. My thoughts while watching is that the graph stuff sounds cool and general, but it's hard to map it efficiently to solutions when the problems have large numbers of inputs. You have to manually optimize for available RAM and CPUs. The system should do all this work for you. But the graph approach is powerful. The programmer provide the bits of atomic behaviour and the system can then try various optimizations. The code doesn't have to change because the graph can be manipulated abstractly on its own. So you can write something like a SQL query. Then something like a query planner figures out how to execute the query on Dryad.
Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers. Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it? Site: http://www.flickr.com/
Hey, this scaling stuff might just be important. Jim Scheinman, former Bebo and Friendster exec, puts the blame squarely on Friendster's inability to scale as why they lost the social networking race: VB: Can you tell me a bit about what you learned in your time at Friendster? JS: For me, it basically came down to failed execution on the technology side — we had millions of Friendster members begging us to get the site working faster so they could log in and spend hours social networking with their friends. I remember coming in to the office for months reading thousands of customer service emails telling us that if we didn’t get our site working better soon, they’d be ‘forced to join’ a new social networking site that had just launched called MySpace…the rest is history. To be fair to Friendster’s technology team at the time, they were on the forefront of many new scaling and database issues that web sites simply hadn’t had to deal with prior to Friendster. As is often the case, the early pioneer made critical mistakes that enabled later entrants to the market, MySpace, Facebook & Bebo to learn and excel. As a postscript to the story, it’s interesting to note that Kent Lindstrom (CEO of Friendster) and the rest of the team have done an outstanding job righting that ship. Hopefully with all the quality information out now on the intertubes visionaries can concentrate on making good stuff instead of always fighting the plumbing. When you think about, is there any industry or group that gives so much value away for free as the software community? I don't think so. We are an amazingly giving group and the world has benefited greatly from that impulse. A thought for Thanksgiving.
Practically any software project nowadays could not survive without a database (DBMS) backend storing all the business data that is vital to you and/or your customers. When projects grow larger, the amount of data usually grows larger exponentially. So you start moving the DBMS to a separate server to gain more speed and capacity. Which is all good and healthy but you do not gain any extra safety for this business data. You might be backing up your database once a day so in case the database server crashes you don't lose EVERYTHING, but how much can you really afford to lose? Well clearly this depends on what kind of data you are storing. In our case the users of our solutions use our software products to do their everyday (all day) work. They have "everything" they need for their business stored in the database we are providing. So is 24 hours of data loss acceptable? No, not really. One hour? Maybe. But what we really want is a second database running with the EXACT same data. We mostly use PostgreSQL which does not have built in database replication. There is some solution based on triggers to replicate the data from one database to another one. We have learned that setting all this up on an existing database with plenty of tables is rather complicated and changing the database structure afterwards can not be done with simple create/alter statements anymore. And since we ARE running solutions that constantly change and improve, we need to be able to deploy updates including database structure changes quickly and easily. So what we really wanted was a transparent JDBC layer that does the replication for us. We tested a great solution called "Sequoia", but it is also a rather heavy-weight product with a lot of features that did not really help in the performance department and that we didn't need anyway. What we needed was:
- a JDBC driver so the application does not know anything about the replication
- of course: transactional safety for write operations
- load-balanced reads (we are running 2 database servers, so why waste the ability to do parallel reads from 2 servers and almost multiply the performance by 2?)
- for backups: the ability to detach one server, do the backup on that machine and then reattach the server
- automatic and transparent failover / failsafe
- Fast In-VM-Replication - no serialisation
- Easy integration
Michael Nygard talks about Two Ways To Boost Your Flagging Web Site. The idea behind cache farms is to move memory devoted to the various caching layers into one large farm of caches, as with memcached. The idea behind read pools is to allocate your database read requests to a pool of dedicated read servers, thus offloading the write server. Using a combination of the strategies you aren't forced to scale up the database tier to scale your website.
Slashdot effect: overwhelming unprepared sites with an avalanche of reader's clicks after being mentioned on Slashdot. Sure, we now have the "Digg effect" and other hot new stars, but Slashdot was the original. And like many stars from generations past, Slashdot plays the elder statesman's role with with class, dignity, and restraint. Yet with millions and millions of users Slashdot is still box office gold and more than keeps up with the young'ins. And with age comes the wisdom of learning how to handle all those users. Just how does Slashdot scale and what can you learn by going old school? Site: http://slashdot.org
The Hardware Architecture
The Software Architecture
Paper: Container-based Operating System Virtualization: A Scalable, High-performance Alternative to Hypervisors
One stumbling block of the the great march towards virtualization is the relatively poor performance of resource hungry applications like databases. We are told to develop and test using VMs, but deploy without them. Which kind of sucks IMHO. Maybe better virtualization technology can remove this split. This paper talks about a different approach to virtualization called "container-based" virtualization that can reportedly double the performance of traditional hypervisor systems like Xen. It does this by trading isolation for efficiency. Rather than maintaining complete isolation between VMs the container approach shares resources between VMs and thus gives higher performance while still guaranteeing strong fault, resource, and security isolation. It's yet another battle in computing's endless war of creating and destroying abstraction layers. I learned a lot from from this paper because of how it compared and contrasted traditional hypervisor and container based virtualization strategies. Good job.
Hi, I would like feed back on a ID generator I just made. What positive and negative effects do you see with this. It's programmed in Java, but could just as easily be programmed in any other typical language. It's thread safe and does not use any synchronization. When testing it on my laptop, I was able to generate 10 million IDs within about 15 seconds, so it should be more than fast enough. Take a look at the attachment.. (had to rename it from IdGen.java to IdGen.txt to attach it) IdGen.java