Jesse Robbins at O'Reily Radar has a nice post on how spending a little up front time on figuring out how to scale your operations process saves money on ops people and allows you to save time adding and upgrading servers. Adding, monitoring, and upgrading servers can get so incredibly screwed up that a herd of squirrels has to work overtime just to put out a release. Or it can be one button simple from your automated build system out to your servers. This is one area where "do the simplest thing that could possibly work" is a dumb idea and Jesse does a good job capturing the advantages of doing it right.
One of the premier scaling strategies is always: get someone else to do the work for you. But unlike Huckleberry Finn in Tom Sawyer, you won't have to trick anyone into whitewashing a fence for you. Times have changed. Companies like Ning, Facebook, and Salesforce are more than happy to help. Their price: lock-in. Previously you had few options when building a "real" website. You needed to do everything yourself. Infrastructure and application were all yours. Then companies stepped in by commoditizing parts of the infrastructure, but the application was still yours. The next step is full on Borg take no prisoners assimilation where the infrastructure and application are built as one collective. What you have to decide as someone faced with building a scalable website is if these new options are worth the price. Feeding this explosion of choice is one of the new strategy games on the intertubes: the Internet Platform Game. Ning's Marc Andreessen defines a platform as: a system that can be programmed and therefore customized by outside developers -- users -- and in that way, adapted to countless needs and niches that the platform's original developers could not have possibly contemplated, much less had time to accommodate. The idea is you'll win great rewards in exchange for coding to someone else's internet platform. From Ning you'll win a featureful and customizable social networking platform that they are completely responsible for scaling. The cost ranges from free to very reasonable. From Facebook you'll win prime space on the profile page of over 40 million virally infected customers. It's free, but you must make your application scalable enough to handle all those millions. By coding to the Salesforce platform you'll win the same infrastructure that executes 100 million Salesforce transactions a day. The cost of their service is unknown at this time.
The Three Levels of Internet PlatformsMr. Andreessen then went a step further and defined a three level platform categorization scheme:
Why Use an Access API?Using open APIs to access services is what has made the internet great. APIs provide the most flexibility at the greatest cost. You get access to a huge number of wonderful services for virtually nothing. The linkage between website is a relatively simple API and a data definition. You can do anything you want, but you have to build the infrastructure to do it. Yet that's a lot better than building your own map service, your own SMS service, or your own photo sharing service. Yet there's still so much work to do. Grid services make the job easier, but the level of expertise it takes to create a scalable site is still very high.
Why Use a Plug-In?Since Facebook is the only internet company in this category the answer is clear why you want to be a Facebook plug-in: to get access to a lot of users, connected by an exploitable social graph, for the purpose of exponentionally propagating your application along the graph. Most would be ecstatic to get to hundreds of thousands of regular users on their own standalone site. With Facebook that's very possible. The reward is great, but the costs are great too. Your application must be something that can be deconstructed onto Facebook. I don't see gmail making it as a Facebook app. You must subject yourself to a lot of restrictions to use the Facebook infrastructure. You must trust yourself to a poorly documented system in which it is hard to get anything done. And to top it off:Facebook does not host your application. This really blew me away when I first heard about it. When someone says they are offering a platform my immediate assumption is they are hosting your application. That's what a platform is, isn't it? But your application must run on your own hardware. Imagine going from 0 to millions of users in the space of a few days. How would you handle that? Well that's exactly the problem ILike (a popular music sharing site) had when they released their Facebook app. Mr. Andreessen gives a wonderful if somewhat self-serving account of ILike's troubles with viral growth. After launching they posted this on their blog: In our first 20 hours of opening doors we had 50,000 users sign up, and it is only accelerating. (10,000 users joined in the first 12 hrs. 10,000 more users in the next 3 hrs. 30,000 more users in the next 5 hrs!!) We started the system not knowing what to expect, with only 2 servers, but ready with backup. Facebook's rabid userbase chewed up our 2 servers almost instantly. We doubled our capacity to catch up. And then we doubled it again. And again. And again. Oh crap - we ran out of servers!! Although iLike.com has a very healthy level of Web traffic, and even though about half of all the servers in our datacenter were sitting unused, idle, as backup capacity, we are now completely maxed out. We just emailed everybody we know across over a dozen Bay Area startups, corporations, and venture firms in a desperate plea to find spare servers so we can triple our capacity for the continued onslaught. Tomorrow we are picking up over 100 servers from different companies to have them installed just to handle the weekend's traffic. (For those who responded to our late night pleas, thank you!) ILike says they now have over 3 million Facebook users and are growing at an astonishing rate of 300,000 users per day. That number of users and growth rate will make almost anyone salivate. Yet how many can afford the hundreds and hundreds of servers it would take to handle all those users, especially if you have an unclear monetization strategy? Which brings us to Deep Hosting and Mr. Andreessen's end game for the internet's evolution.
Why Use Deep Hosting?The trouble with handling application growth under Facebook's large user base has an obvious solution: host your application on their infrastructure. This is exactly what Mr. Andreessen has done with Ning. Out of the Ning box you get an exceptionally functional social networking package. So functional in fact it makes almost anyone think "do I really need to reinvent all this stuff when they've already done it? Can't I just tweak a few things and make it my own?" And that's exactly what Ning wants to hear. They've made it so you can completely rebrand their software, add your own features using normal programming tools, yet still host your application on their platform, on their servers, in their datacenter. So you don't have to worry about scaling. Its Ning's job to scale the database, back it up, manage the infrastructure, add servers, and do all the other nasty bits that keep so many people away from deploying successful websites. So the temptation is clear. Go with Ning and you immediately get a cool system that will scale and that you can still program if you feel the need. But with all that power comes a price, as usual. You are locked inside a gilded cage. If your application slows down there's not much you can do about it. I found their documentation better than Facebook's, but not very useful for someone looking to get going quickly and that makes me very nervous when adopting a platform. Yet when they add features, as they frequently do, your app gets them for free. You see some of the same effects here that all Google apps get when the Google stack is improved. And not having to worry about scalability is very attractive, especially at such a reasonable cost.
Problems with Deep HostingMr. Andreessen thinks that "in the long run, all credible large-scale Internet companies will provide Level 3 platforms." There are three problems with this argument.
What does this mean for you?I've found it difficult to reconcile all the different pros and cons of each approach. There is a definite value in all these alternatives. If you have a vision for an application then building it yourself is the only way you'll achieve that vision. So do it yourself. But what good is a vision without users? So go Facebook. But I could get something going very quickly in Ning and the expand overtime with much less hassle, even if it's not exactly what I want. So go Ning. What to do? The point of this post isn't to come to a conclusion. The point has been to cover some new and different approaches to scalability so you can spend a few sleepless nights pondering your options too :-)
pNFS (parallel NFS) is the next generation of NFS and its main claim to fame is that it's clustered, which "enables clients to directly access file data spread over multiple storage servers in parallel. As a result, each client can leverage the full aggregate bandwidth of a clustered storage service at the granularity of an individual file." About pNFS StorageMojo says: pNFS is going to commoditize parallel data access. In 5 years we won’t know how we got along without it. Something to watch.
Update 2: 3tera has added Dynamic Appliances, which are "packaged data center operations like backup, migration or SLAs that users can add to their applications to provide functionality." Update: in an effort to help cross the chasm of how start building a website using their grid OS, 3tera is offering their Assured Success Plan. The idea is to provide training, consulting, and support so you can get started with some confidence you'll end up succeeding. If you are starting or extending a website you have a problem: what technologies should you use? Now there are more answers to that question than ever. One new and refreshingly innovative answer is 3tera's grid OS. In this podcast interview with Bert Armijo from 3tera, we'll learn how 3tera wants to change how you build websites. How? By transforming the physical into the virtual and then allowing the virtual to be manipulated as if it were real. Could I possibly be more abstract? Not really. But when I think of what they are doing that's the mental model I see whirling around in my mind. Don't worry, I promise we'll drill down to how it can help you in the real world. Let's see how. I think of 3tera's product as like staying at a nice hotel. At home you are in charge. If something needs doing you must do it. If something breaks you must fix it. But at a nice hotel everything just happens for you. Your room is cleaned, beds are made, outrageously expensive candy bars are replaced in the mini-bar, food arrives when you order it and plates disappear when you are done, and the courtesy mint is placed just so on your pillow. You are free to simply enjoy your stay. All the other details of living just happen. That's the same sort of experience 3tera is trying to provide for your website. You can concentrate on your application and 3tera, through their GUI on the front-end and their AppLogic grid operating system on the back-end, worries about all the housekeeping. I think Bert summed up their goal wonderfully when said their aim is to:
Get peoples hands off physical boxes and to give them a way to define complex infrastructures in a reusable way that they can then instantiate, trade, sell, or replicate, backup up and manage as individual units. This is what AppLogic that does incredibly way.What they are doing is taking hard physical resources like CPU and storage and decoupling them from their physical sources so you can just order and use them on demand without worrying how its done under the covers. This is trend that has been happening for a while, but their grid OS takes that process to the next level. Your physical co-lo cage is now a private virtual data center. Physical boxes, once lovingly spec'ed, bought, and installed are now allocated on demand from a phalanx of preconfigured and separately maintained servers. Physical storage, once lovingly pieced together from disks, controllers, and networks is now allocated from a vast unending sea of virtual storage. Physical load balancers are now programs you can create. What this means for you is you can take a website architecture you've draw up on your white board and simply and quickly create it in a data center. Its all configurable from a GUI. You can bring on 10 new web servers with a simple drag and drop operation. It's basically your white board diagram come to life, only you get to skip all the nasty implementation bits. In the virtual world the nasty non application related implementation bits are someone else's problem. 3tera's value proposition pretty easy to understand:
Podcast Notes I know what you are probably saying. You are saying: "But Todd, the podcast is over an hour long, couldn't you have please made it longer? I have nothing else to do today and I need to waste more time!" What can I say, Bert was very knowledgeable and helpful, and this is a new model for building scalable websites so I was trying to figure out how I could physically make a website using their product. That takes a lot of questions. I am happy with the result though. I think I have a good picture of how their system works and I think its well worth investigating if you are in the market for creating or expanding a website. Here are some notes taken from the podcast.
Some Observations and Conclusions
Related Sites and Articles
Robert Stewart shared this useful Ajax related scalability strategy: We avoided XMLHttpRequests for individual keystrokes, choosing to go back to the server only when a field lost focus. Google can afford all the servers to handle the load for that, but we didn't want to. Do you have a scalability strategy to share? Then share it!.
I've been using GWT for an application and I get the same feeling using it that I first got using html. I've always sucked at building UIs. Starting with programming HP terminals, moving on to the Apple Lisa, then X Windows, and Microsoft Windows, I just never had IT, whatever IT is. On the Beauty and the Geek scale my interfaces are definitely horned-rimmed and pocket protector friendly. Html helped free me from all that to just build stuff that worked, but didn't have to look all that great. Expectations were pretty low and I eagerly fulfilled them. With Ajax expectations have risen again and I find myself once more easily identifiable as a styless geek. Using GWT I have some hopes I can suck a little less. In working with GWT I was so focussed on its tasty easily digestible Ajaxy goodness, I didn't stop to think about the topic of this site: scalability. When I finally brought my distracted mind around to consider the scalability of the single page webs site I was building, I became a bit concerned. Many of the strategies that are typically used to achieve scalability don't seem to apply in single page land. Here are the issues I see. Maybe you can tell me where I am off in my analysis?
Hello,everybody,I'm plant to building a new website like 2008.sina.com.cn 2008.sohu.com .site contents have pic news,text news,and video news.user blog ....now I have a question to ask everybody,I hope can get usefully information to here. status: 100,000,000 people /per day 50,000 people /peak time more than 200 servers OpenBSD/Opensuse Apache Fast CGI modules lighttpd for picture Mysql varnish LVS lucene search do you have a good idea to it?thans for everybody!
I have 3 exp in building website using java.I work on only single server.Website is not very scalable.I always wonder how ebay,youtube,monster handle traffic, giving responses within seconds.From the google i find this site and i hope i can also able to build very scalable website .I need guidelines from where to start ,what are the things we needed.I know that scalability comes through the use of distributed applications but don't how to implement it. I see many website build in languages other than java so java is good choice for building high scalable website. Thanks
Complex applications coordinating work across a lot of machines often need a highly performing fault tolerant message layer. Though a blast to write, it's probably a better use of your time to use an off the shelf solution. And that's where Spread comes in. Flickr, for example, uses Spread to create real-time event feeds from their web server logs. What exactly is Spread? From the Spread website:
Spread is an open source toolkit that provides a high performance messaging service that is resilient to faults across local and wide area networks. Spread functions as a unified message bus for distributed applications, and provides highly tuned application-level multicast, group communication, and point to point support. Spread services range from reliable messaging to fully ordered messages with delivery guarantees. Spread can be used in many distributed applications that require high reliability, high performance, and robust communication among various subsets of members. The toolkit is designed to encapsulate the challenging aspects of asynchronous networks and enable the construction of reliable and scalable distributed applications. Some of the services and benefits provided by Spread:In Building Scalable Web Sites Cal Henderson describes how Flickr uses Spread to create a log of real-time events, like photos uploaded and discussions started, as they happen. Spread is connected to their web servers. As photos are uploaded these web server events are messaged in real-time to agents consuming the feed. The advantage of this architecture is it sheds load away from the database. Otherwise the database would have to be continuously polled for new events by each agent.
Reliable and scalable messaging and group communication. A very powerful but simple API simplifies the construction of distributed architectures. Easy to use, deploy and maintain. Highly scalable from one local area network to complex wide area networks. Supports thousands of groups with different sets of members. Enables message reliability in the presence of machine failures, process crashes and recoveries, and network partitions and merges. Provides a range of reliability, ordering and stability guarantees for messages. Emphasis on robustness and high performance. Completely distributed algorithms with no central point of failure.