Skype's 220 millions users lost service for a stunning two days. The primary cause for Skype's nightmare (can you imagine the beeper storm that went off?) was a massive global roll-out of a Window's patch triggering the simultaneous reboot of millions of machines across the globe. The secondary cause was a bug in Skype's software that prevented "self-healing" in the face of such attacks. The flood of log-in requests and a lack of "peer-to-peer resources" melted their system. Who's fault is it? Is Skype to blame? Is Microsoft to blame? Or is the peer-to-peer model itself fundamentally flawed in some way? Let's be real, how could Skype possibly test booting 220 million servers over a random configuration of resources? Answer: they can't. Yes, it's Skype's responsibility, but they are in a bit of a pickle on this one. The boot scenario is one of the most basic and one of the most difficult scalability scenarios to plan for and test. You can't simulate the viciousness of real-life conditions in a lab because only real-life has the variety of configurations and the massive resources needed to simulate itself. It's like simulating the universe. How do you simulate the universe if the computational matrix you need is the universe itself? You can't. You end up building smaller models and those models sometimes fail. I worked at set-top company for a while and our big boot scenario was the restart of entire neighbor hoods after a power failure. To make an easy upgrade path, each set-top downloaded their image from the head-end on boot, only a boot image was in EEPROM. This is a very stressful scenario for the system. How do you test it? How do you test thousands of booting set-tops when they don't even exist yet? How do you test the network characteristics of a cable system in the lab? How do you design a system not to croak under the load? Cleverness. One part of the solution was really cool. The boot images were continually broadcast over the network so each set-top would pick up blocks of the boot image. The image would be stitched together from blocks rather than having thousands of boxes individually download images, which would never work. This massively reduced the traffic over the network. Clever tricks like this can get you a long ways. Work. Great pools of workstations were used simulate set-tops and software was made to insert drops and simulate asymmetric network communications. But how could we ever simulate 220 million different users? Then, no way. Maybe now you could use grid services like Amazon's EC2. Help from your friends. Microsoft is not being a good neighbor. They should roll out updates at a much more gradual rate so these problems don't happen. Booting loads networks, taxes CPUs, fills queues, drops connections, stresses services, increases process switching, drops packets, encourages dead lock, steals RAM and file descriptors and other resources. So it would be nice if MS was smarter about their updates. But since you can't rely on such consideration, you always have to handle the load. I assume they used exponential backoff algorithms to limit login attempts, but with so many people this probably didn't matter. Perhaps they could insert a random wait to smooth out login traffic. But again, with so many people it probably won't matter. Perhaps they could stop automatic logins on boot? That would solve the problem at the expense of user convenience. No go. Perhaps their servers could be tuned to accept connections at a fast rate yet condition how quickly they respond to the rest of the login process? Not good enough I suppose. So how did Skype fix their problem? They explain it here : The parameters of the P2P network have been tuned to be smarter about how similar situations should be handled. Once we found the algorithmic fix to ensure continued operation in the face of high numbers of client reboots, the efforts focused squarely on stabilizing the P2P core. The fix means that we’ve tuned Skype’s P2P core so that it can cope with simultaneous P2P network load and core size changes similar to those that occurred on August 16. Whenever I see the word "tune" I get the premonition shivers. Tuning means you are just one unexpected problem away from being out of tune and your perfectly functioning symphony sounding like a band of percussion happy monkeys. Tuned things break under change. Tweak the cosmological constant just a little and wham, there's no human life. It needs to work by design. Or it needs to be self-adaptive and not finessed by human hands for each new disaster scenario. And this is where we get into the nature of P2P. Would the same problem have happened in a centralized architecture with resources spread strategically throughout the globe and automatic load balancing between different data centers? In a centralized model would it have been easier to bring more resources on line to handle the load? Would the outage have been easier to diagnose and last a much shorter amount of time? There are of course no definitive answers to these questions. But many of the web's most successful systems like YouTube, Amazon, Ebay, Google, GoogleTalk, and Flickr use a centralized model. They handle millions of users and massive amounts of content and have pretty good reliability records. Does P2P bring enough to the architecture that you should build a system around it? That to me is the interesting question that arises out of this incident.
An online google guide,tools and Utilities to use with the Google search engine.Free google tools and utilities.
Varnish is a state-of-the-art, high-performance HTTP accelerator. Varnish is targeted primarily at the FreeBSD 6 and Linux 2.6 platforms, and will take full advantage of the virtual memory system and advanced I/O features offered by these operating systems. Varnish was written from the ground up to be a high performance caching reverse proxy. Squid is a forward proxy that can be configured as a reverse proxy. Besides - Squid is rather old and designed like computer programs where supposed to be designed in 1980. Varnish is reported to be 10x-20x faster than Squid on the same hardware.
I was looking at the pingdom infrastructure matrix (http://royal.pingdom.com/royalfiles/0702_infrastructure_matrix.pdf) and I saw that no sites are using Postgresql, and then I searched through highscalability.com and saw very few mentions of postgresql. Are there any examples of high-traffic sites that use postgresql? Does anyone have any experience with it? I'm having trouble finding good, recent studies of postgres (and postgres compared w/ mysql) online.
Hi, Some of the articles of the site claims profiling is essential. Is there any established approach to profiling WEB apps? Or it too much depends on technologies used?
Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet. Site: http://wikimedia.org/
Amazingly TechCrunch runs their website on one web server and one database server, according to the fascinating survey What the Web’s most popular sites are running on by Pingdom, a provider of uptime and response time monitoring. Early we learned PlentyOfFish catches and releases many millions of hits a day on just 1 web server and three database servers. Google runs a Dalek army full of servers. YouSendIt, a company making it easy to send and receive large files, has 24 web servers, 3 database servers, 170 storage servers, and a few miscellaneous servers. Vimeo, a video sharing company, has 100 servers for streaming video, 4 web servers, and 2 database servers. Meebo, an AJAX based instant messaging company, uses 40 servers to handle messaging, over 40 web servers, and 10 servers for forums, jabber, testing, and so on. FeedBurner, a news feed management company, has 70 web servers, 15 database servers, and 10 miscellaneous servers. Now multiply FeedBurner's server count by two because they maintain two geographically separate sites, in an active-passive configuration, for high availability purposes. How many servers will you need and how can you trick yourself into using fewer?
Find Someone Like You and Base Your Resource Estimates Off ThemWe see quite a disparity in the number of servers needed for popular web sites. It ranges from just a few servers to many hundreds. Where do you fit? The easiest approach to figuring out how many servers you'll need is to find a company similar to yours and look how many they need. You won't need that many right away, but as you grow it's something to think about. Can your data center handle your growth? Do they have enough affordable bandwidth and rack space? How will you install and manage all the machines? Who will do the work? And a million other similar questions that might be better handled if you had some idea where you are going.
Get Someone Else to Do itClearly content sites end up needing a lot of servers. Videos, music, pictures, blogs, and attachments all eat up space and since that's your business you have no alternative but to find a way to store all that data. This is unstructured data that can be stored outside the database in a SAN or NAS. Or, rather that building your own storage infrastructure, you can follow the golden rule of laziness: get someone else to do it. That's what SmugMug, an image sharing company did. They use S3 to store many hundreds of terabytes of data. This drops the expense of creating a large highly available storage infrastructure so much that it creates a whole new level of competition for content rich sites. At one time expertise in creating massive storage farms would have been enough to keep competition away, but no more. These sorts of abilities are becoming commoditized, affordable, and open. PlentyOfFish and YouTube make use of CDNs to reduce the amount of infrastructure they need to create for themselves. If you need to stream video why not let a CDN do it instead of building out your own expensive infrastructure? You can take a "let other people do it approach" for services like email, DNS, backup, forums, and blogs too. These are all now outsourcable. Does it make sense to put these services in your data center if you don't need to? If you have compute intensive tasks you can use Amazon services without needing to perform your own build out. And an approach I am really excited to investigate in the future is a new breed of grid based virtual private data centers like 3tera and mediatemple. Their claim to fame is that you can componetize your infrastructure in such a way that you can scale automatically and transparently using their grid as demand fluctuates. I don't have any experience with this approach yet, but it's interesting and probably where the world is heading. If your web site is relatively simple blog then with mostly static content then you can get away with far fewer servers. Even a popular site like Digg has only 30GB of data to store.
How do your resources scale with the number of users?A question you have to ask also is do your resources scale linearly, exponentially, or not much at all with the number of users. A blog site may not scale much with the number of users. Some sites scale linearly as users are added. And others sites that rely on social interaction, like Google Talk, may scale exponentially as users are added. Getting a feel for the type of site you have can help more realistic numbers pop up on your magic server eight-ball.
What's your caching strategy?A lot of sites use Memcached and Squid for caching. You can fill up a few racks with caching servers. How many servers will you need for caching? Or can you get away with just beefing up the database server cache?
Do you need servers for application specific tasks?Servers aren't just for storage, database, and the web servers. You may have a bit of computation going on. YouTube offloads tag calculations to a server farm. GoogleTalk has to have servers for handling presence calculations. PlentyOfFish has servers to handle geographical searches because they are so resource intensive. GigaVox needs servers to transcode podcasts into different formats and include fresh commercial content. If you are a calendar service you may need servers to calculate more complicated schedule availability schemes and to sync address books. So depending on your site, you may have to budget for many application related servers like these. The Pingdom folks also created a sweet table on what technologies the companies profiled on this site are using. You can find it at What nine of the world’s largest websites are running on. I'm very jealous of their masterful colorful graphics-fu style. Someday I hope rise to that level of presentation skill.
That's what people at the NGDC Conference get together and talk about. A lot of interesting subjects: data center virtualization HPC & grid; advanced facilitates management and planning; advanced network and services; applications; data center optimization and security; managing and protecting information. The Grid – Distributed Computing at Scale presentation is an interesting one.
TypePad is considered the largest paid blogging service in the world. After experience problems because of their meteoric growth, they eventually transitioned to an architecture patterned after their sister company, LiveJournal. Site: http://www.typepad.com/
The questions was extracted from: http://highscalability.com/plentyoffish-architecture#comment-126 For startup like Markus, what is the best hosting option (and grow more later)? host your own server or use ISP co-location option? He still has to pay huge money on the bandwidth with that payload, right?