Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that? This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.
Hi, I saw an year ago that Netapp sold netcache to blu-coat, my site is a heavy NetCache user and we cached 83% of our site. We tested with Blue-coat and F5 WA and we are not getting same performce as NetCache. Any of you guys have the same issue? or somebody knows another product can handle much traffic? Thanks Rodrigo
Bees have a similar problem to website servers: how to do a lot of work with limited resources in an ever changing environment. Usually lessons from biology are hard to apply to computer problems. Nature throws hardware at problems. Billions and billions of cells cooperate at different levels of organizations to find food, fight lions, and make sure your DNA is passed on. Nature's software is "simple," but her hardware rocks. We do the opposite. For us hardware is in short supply so we use limited hardware and leverage "smart" software to work around our inability to throw hardware at problems. But we might be able to borrow some load balancing techniques from bees. What do bees do that we can learn from? Bees do a dance to indicate the quality and location of a nectar source. When a bee finds a better source they do a better dance and resources shift to the new location. This approach may seem inefficient, but it turns out to be "optimal for the unpredictable nectar world." Craig Tovey and Sunil Nakrani are trying to apply these lessons to more efficiently allocate work to servers: Tovey and Nakrani set to work translating the bee strategy for these idle Internet servers. They developed a virtual “dance floor” for a network of servers. When one server receives a user request for a certain Web site, an internal advertisement (standing in a little less colorfully for the dance) is placed on the dance floor to attract any available servers. The ad’s duration depends on the demand on the site and how much revenue its users may generate. The longer an ad remains on the dance floor, the more power available servers devote to serving the Web site requests advertised. Sounds like an open source project that could get a lot of good buzz. You can imagine lots of cool logos and sweet project names. Maybe it could be sponsored by the Honey council?
From the website: The lbpool project provides a load balancing JDBC driver for use with DB connection pools. It wraps a normal JDBC driver providing reconnect semantics in the event of additional hardware availability, partial system failure, or uneven load distribution. It also evenly distributes all new connections among slave DB servers in a given pool. Each time connect() is called it will attempt to use the best server with the least system load. The biggest scalability issue with large applications that are mostly READ bound is the number of transactions per second that the disks in your cluster can handle. You can generally solve this in two ways. 1. Buy bigger and faster disks with expensive RAID controllers. 2. Buy CHEAP hardware on CHEAP disks but lots of machines. We prefer the cheap hardware approach and lbpool allows you to do this. Even if you *did* manage to use cheap hardware most load balancing hardware is expensive, requires a redundant balancer (if it were to fail), and seldom has native support for MySQL. The lbpool driver addresses all these needs. The original solution was designed for use within MySQL replication clusters. This generally involves a master server handling all writes with a series of slaves which handle all reads. In this situation we could have hundreds of slaves and lbpool would load balance queries among the boxes. If you need more read performance just buy more boxes. If any of them fail it won't hurt your application because lbpool will simply block for a few seconds and move your queries over to a new production server. In this post Kevin Burton of Spinn3r mentions they've been using this product to good effect for handling MySQL replication faults, balancing, and crashed servers.
Mogulus Doesn't Own a Single Server and has $1.2 million in funding, 15,000 People Creating Channels
Scoble the Ubiquitous has a fascinating post on how Mogulus, a live video channel startup, uses S3/EC2 and doesn't own a single server. The trends that have been happening for a while now are going mainstream. To do great things you no longer need to start by creating a huge war chest. You can forage off the land, like any good mobile, light weight fighting unit. For a strategy hit he mentions the same needed change in perspective as Beau Lebens talked about when making FeedBlendr: One tip he gave us is that when using Amazon’s services you have to design your systems with the assumption that they will never be up and running. What he means by that is services are “volatile” and can go up and down without notice. So, he’s designed his systems to survive that. He told me that it meant his engineering teams had to be quite disciplined in designing their architecture.
I'm moving this from the forum section to the front page. Just FYI, any registered user can Submit a Link to this blog. You don't have to use the forums. In The Architectures You've Always Wondered About track at the Qcon conference, Second Life, eBay, Yahoo, LinkedIn and Orbitz presented how they dealt with different aspects of their applications, such as scalability. There were quite a few lessons that I learned that day that I thought were worth sharing.
In The Architectures You've Always Wondered About track at the Qcon conference, Second Life, eBay, Yahoo, LinkedIn and Orbitz presented how they dealt with different aspects of their applications, such as scalability. There were quite a few lessons that I learned that day that I thought were worth sharing. The details are provided below: Lessons from Yahoo, eBay, Orbitz, LinkedIn architecture
Dryad is Microsoft's answer to Google's map-reduce. What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars? Dryad models programs as the execution of a directed acyclic graph. Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). Map-reduce uses a different model. It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. Each approach seems to borrow from the spirit of its creating organization. The graph approach seems a bit too complicated and map-reduce seems a bit too simple. How ironic, in the Alanis Morissette sense. Dryad is a middleware layer that executes graphs for you, automatically taking care of scheduling, distribution, and fault tolerance. It's written in C++, but apparently few write directly to this layer, most people use higher layer interfaces. A Job Manager runs the program. It's a library you link in and it loads and executes the graph. A daemon runs on each machine to run jobs. A name server provides access to cluster resources. The DAG is a multigraph so you can have multiple edges between vertices. A DAG was chosen because it's not too cold, or too hot, the porridge is just right. Cycles are too hard. Simpler isn't as useful. DAGs support relational algebra and can split multiple inputs and outputs nicely. One interesting aspect is a a channel is a sequence of structure items that are C++ objects. This means pointers can be passed directly so you don't have to worry about serialization overhead. No restrictions are put on the data model. Graphs are dynamically changeable at runtime which allows for a lot of optimizations. Several case studies were provided. It's probably just me, but I didn't really understand what was going on. Google's example is much better. Everyone can relate to counting words in a document. My thoughts while watching is that the graph stuff sounds cool and general, but it's hard to map it efficiently to solutions when the problems have large numbers of inputs. You have to manually optimize for available RAM and CPUs. The system should do all this work for you. But the graph approach is powerful. The programmer provide the bits of atomic behaviour and the system can then try various optimizations. The code doesn't have to change because the graph can be manipulated abstractly on its own. So you can write something like a SQL query. Then something like a query planner figures out how to execute the query on Dryad.
Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers. Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it? Site: http://www.flickr.com/
Hey, this scaling stuff might just be important. Jim Scheinman, former Bebo and Friendster exec, puts the blame squarely on Friendster's inability to scale as why they lost the social networking race: VB: Can you tell me a bit about what you learned in your time at Friendster? JS: For me, it basically came down to failed execution on the technology side — we had millions of Friendster members begging us to get the site working faster so they could log in and spend hours social networking with their friends. I remember coming in to the office for months reading thousands of customer service emails telling us that if we didn’t get our site working better soon, they’d be ‘forced to join’ a new social networking site that had just launched called MySpace…the rest is history. To be fair to Friendster’s technology team at the time, they were on the forefront of many new scaling and database issues that web sites simply hadn’t had to deal with prior to Friendster. As is often the case, the early pioneer made critical mistakes that enabled later entrants to the market, MySpace, Facebook & Bebo to learn and excel. As a postscript to the story, it’s interesting to note that Kent Lindstrom (CEO of Friendster) and the rest of the team have done an outstanding job righting that ship. Hopefully with all the quality information out now on the intertubes visionaries can concentrate on making good stuff instead of always fighting the plumbing. When you think about, is there any industry or group that gives so much value away for free as the software community? I don't think so. We are an amazingly giving group and the world has benefited greatly from that impulse. A thought for Thanksgiving.