advertise
Monday
Nov192007

Tailrank Architecture - Learn How to Track Memes Across the Entire Blogosphere

Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that? This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.

Sites

  • Tailrank - We track the hottest news in the blogosphere!
  • Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
  • Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.

    Platform

  • MySQL
  • Java
  • Linux (Debian)
  • Apache
  • Squid
  • PowerDNS
  • DAS storage.
  • Federated database.
  • ServerBeach hosting.
  • Job scheduling system for work distribution.

    Interview

  • What is your system is for? Tailrank originally a memetracker to track the hottest news being discussed within the blogosphere. We started having a lot of requests to license our crawler and we shipped that in the form of Spinn3r about 8 months ago. Spinn3r is self contained crawler for companies that want to index the full blogosphere and consumer generated media. Tailrank is still a very important product alongside Spinn3r and we're working on Tailrank 3.0 which should be available in the future. No ETA at the moment but it's actively being worked on.
  • What particular design/architecture/implementation challenges does your system have? The biggest challenge we have is the sheer amount of data we have to process and keeping that data consistent within a distributed system. For example, we process 52TB of content per month. this has to be indexed in a highly available storage architecture so the normal distributed database problems arise.
  • What did you do to meet these challenges? We've spent a lot of time in building out a distributed system that can scale and handle failure. For example, we've built a tool called Task/Queue that is analogous to Google's MapReduce. It has a centralized queue server which hands out units of work to robots which make requests. It works VERY well for crawlers in that slower machines just fetch work at a slower rate while more modern machines (or better tuned machines) request work at a higher rate. This ends up easily solving one of the main distributed computing fallacies that the network is homogeneous. Task/Queue is generic enough that we could actually use it to implement MapReduce on top of the system. We'll probably open source it at some point. Right now it has too many tentacles wrapped into other parts of our system.
  • How big is your system? We index 24M weblogs and feeds per hour and process content at about 160-200Mbps. At the raw level we're writing to our disks at about 10-15MBps continuously.
  • How many documents, do you serve? How many images? How much data? Right now the database is about 500G. We're expecting it to grow well beyond this in 2008 as we expand our product offering.
  • What is your rate of growth? It's mostly a function of customer feature requests. If our customers want more data we sell it to them. In 2008 we're planning on expanding our cluster to index larger portions of the web and consumer generated media.
  • What is the architecture of your system? We use Java, MySQL and Linux for our cluster. Java is a great language for writing crawlers. The library support is pretty solid (though it seems like Java 7 is going to be killer when they add closures). We use MySQL with InnoDB. We're mostly happy with it though it seems I end up spending about 20% of my time fixing MySQL bugs and limitations. Of course nothing is perfect. MySQL for example was really designed to be used on single core systems. The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks. I recently blogged about how these the new multi-core machines should really be considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
  • How is your system architected to scale? We use a federated database system so that we can split the write load as we see more IO. We've released a lot of our code as Open Source a lot of our infrastructure and this will probably be released as Open Source as well. We've already opened up a lot of our infrastructure code:
  • http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
  • http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
  • http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
  • http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
  • http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
  • http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
  • How many servers do you have? About 15 machines so far. We've spent a lot of time tuning our infrastructure so it's pretty efficient. That said, building a scalable crawler is not an easy task so it does take a lot of hardware. We're going to be expanding FAR past this in 2008 and will probably hit about 2-3 racks of machines (~120 boxes).
  • What operating systems do you use? Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know why more hardware vendors don't support Debian. Debian is the big secret in the valley that no one talks about. Most of the big web 2.0 shops like Technorati, Digg, etc use Debian.
  • Which web server do you use? Apache 2.0. Lighttpd is looking interesting as well.
  • Which reverse proxy do you use? About 95% of the pages of Tailrank are served from Squid.
  • How is your system deployed in data centers? We use ServerBeach for hosting. It's a great model for small to medium sized startups. They rack the boxes, maintain inventory, handle network, etc. We just buy new machines and pay a flat markup. I wish Dell, SUN, HP would sell directly to clients in this manner. One right now. We're looking to expand into two for redundancy.
  • What is your storage strategy? Directly attached storage. We buy two SATA drives per box and set them up in RAID 0. We use the redundant array of inexpensive databases solution so if an individual machine fails there's another copy of the data on another box. Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
  • Do you have a standard API to your website? Tailrank has RSS feeds for every page. The Spinn3r service is itself an API and we have extensive documentation on the protocol. It's also free to use for researchers so if any of your readers are pursuing a Ph.D and generally doing research work and needs access to blog data we'd love to help them out. We already have the Ph.D students at the University of Washington and University of Maryland (my Alma Matter) using Spinn3r.
  • Which DNS service do you use? PowerDNS. It's a great product. We only use the recursor daemon but it's FAST. It uses async IO though so it doesn't really scale across processors on multicore boxes. Apparenty there's a hack to get it to run across cores but it isn't very reliable. AAA caching might be broken though. I still need to look into this.
  • Who do you admire? Donald Knuth is the man!
  • How are you thinking of changing your architecture in the future? We're still working on finishing up a fully sharded database. MySQL fault tolerance and autopromotion is also an issue.

    Click to read more ...

  • Sunday
    Nov182007

    Reverse Proxy

    Hi, I saw an year ago that Netapp sold netcache to blu-coat, my site is a heavy NetCache user and we cached 83% of our site. We tested with Blue-coat and F5 WA and we are not getting same performce as NetCache. Any of you guys have the same issue? or somebody knows another product can handle much traffic? Thanks Rodrigo

    Click to read more ...

    Saturday
    Nov172007

    Can How Bees Solve their Load Balancing Problems Help Build More Scalable Websites?

    Bees have a similar problem to website servers: how to do a lot of work with limited resources in an ever changing environment. Usually lessons from biology are hard to apply to computer problems. Nature throws hardware at problems. Billions and billions of cells cooperate at different levels of organizations to find food, fight lions, and make sure your DNA is passed on. Nature's software is "simple," but her hardware rocks. We do the opposite. For us hardware is in short supply so we use limited hardware and leverage "smart" software to work around our inability to throw hardware at problems. But we might be able to borrow some load balancing techniques from bees. What do bees do that we can learn from? Bees do a dance to indicate the quality and location of a nectar source. When a bee finds a better source they do a better dance and resources shift to the new location. This approach may seem inefficient, but it turns out to be "optimal for the unpredictable nectar world." Craig Tovey and Sunil Nakrani are trying to apply these lessons to more efficiently allocate work to servers: Tovey and Nakrani set to work translating the bee strategy for these idle Internet servers. They developed a virtual “dance floor” for a network of servers. When one server receives a user request for a certain Web site, an internal advertisement (standing in a little less colorfully for the dance) is placed on the dance floor to attract any available servers. The ad’s duration depends on the demand on the site and how much revenue its users may generate. The longer an ad remains on the dance floor, the more power available servers devote to serving the Web site requests advertised. Sounds like an open source project that could get a lot of good buzz. You can imagine lots of cool logos and sweet project names. Maybe it could be sponsored by the Honey council?

    Click to read more ...

    Friday
    Nov162007

    Product: lbpool - Load Balancing JDBC Pool

    From the website: The lbpool project provides a load balancing JDBC driver for use with DB connection pools. It wraps a normal JDBC driver providing reconnect semantics in the event of additional hardware availability, partial system failure, or uneven load distribution. It also evenly distributes all new connections among slave DB servers in a given pool. Each time connect() is called it will attempt to use the best server with the least system load. The biggest scalability issue with large applications that are mostly READ bound is the number of transactions per second that the disks in your cluster can handle. You can generally solve this in two ways. 1. Buy bigger and faster disks with expensive RAID controllers. 2. Buy CHEAP hardware on CHEAP disks but lots of machines. We prefer the cheap hardware approach and lbpool allows you to do this. Even if you *did* manage to use cheap hardware most load balancing hardware is expensive, requires a redundant balancer (if it were to fail), and seldom has native support for MySQL. The lbpool driver addresses all these needs. The original solution was designed for use within MySQL replication clusters. This generally involves a master server handling all writes with a series of slaves which handle all reads. In this situation we could have hundreds of slaves and lbpool would load balance queries among the boxes. If you need more read performance just buy more boxes. If any of them fail it won't hurt your application because lbpool will simply block for a few seconds and move your queries over to a new production server. In this post Kevin Burton of Spinn3r mentions they've been using this product to good effect for handling MySQL replication faults, balancing, and crashed servers.

    Click to read more ...

    Friday
    Nov162007

    Mogulus Doesn't Own a Single Server and has $1.2 million in funding, 15,000 People Creating Channels

    Scoble the Ubiquitous has a fascinating post on how Mogulus, a live video channel startup, uses S3/EC2 and doesn't own a single server. The trends that have been happening for a while now are going mainstream. To do great things you no longer need to start by creating a huge war chest. You can forage off the land, like any good mobile, light weight fighting unit. For a strategy hit he mentions the same needed change in perspective as Beau Lebens talked about when making FeedBlendr: One tip he gave us is that when using Amazon’s services you have to design your systems with the assumption that they will never be up and running. What he means by that is services are “volatile” and can go up and down without notice. So, he’s designed his systems to survive that. He told me that it meant his engineering teams had to be quite disciplined in designing their architecture.

    Click to read more ...

    Thursday
    Nov152007

    Lessons from Yahoo, eBay, Orbitz, LinkedIn architecture

    I'm moving this from the forum section to the front page. Just FYI, any registered user can Submit a Link to this blog. You don't have to use the forums. In The Architectures You've Always Wondered About track at the Qcon conference, Second Life, eBay, Yahoo, LinkedIn and Orbitz presented how they dealt with different aspects of their applications, such as scalability. There were quite a few lessons that I learned that day that I thought were worth sharing.

    Click to read more ...

    Thursday
    Nov152007

    Lessons from Yahoo, eBay, Orbitz, LinkedIn architecture

    In The Architectures You've Always Wondered About track at the Qcon conference, Second Life, eBay, Yahoo, LinkedIn and Orbitz presented how they dealt with different aspects of their applications, such as scalability. There were quite a few lessons that I learned that day that I thought were worth sharing. The details are provided below: Lessons from Yahoo, eBay, Orbitz, LinkedIn architecture

    Click to read more ...

    Thursday
    Nov152007

    Video: Dryad: A general-purpose distributed execution platform

    Dryad is Microsoft's answer to Google's map-reduce. What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars? Dryad models programs as the execution of a directed acyclic graph. Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). Map-reduce uses a different model. It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. Each approach seems to borrow from the spirit of its creating organization. The graph approach seems a bit too complicated and map-reduce seems a bit too simple. How ironic, in the Alanis Morissette sense. Dryad is a middleware layer that executes graphs for you, automatically taking care of scheduling, distribution, and fault tolerance. It's written in C++, but apparently few write directly to this layer, most people use higher layer interfaces. A Job Manager runs the program. It's a library you link in and it loads and executes the graph. A daemon runs on each machine to run jobs. A name server provides access to cluster resources. The DAG is a multigraph so you can have multiple edges between vertices. A DAG was chosen because it's not too cold, or too hot, the porridge is just right. Cycles are too hard. Simpler isn't as useful. DAGs support relational algebra and can split multiple inputs and outputs nicely. One interesting aspect is a a channel is a sequence of structure items that are C++ objects. This means pointers can be passed directly so you don't have to worry about serialization overhead. No restrictions are put on the data model. Graphs are dynamically changeable at runtime which allows for a lot of optimizations. Several case studies were provided. It's probably just me, but I didn't really understand what was going on. Google's example is much better. Everyone can relate to counting words in a document. My thoughts while watching is that the graph stuff sounds cool and general, but it's hard to map it efficiently to solutions when the problems have large numbers of inputs. You have to manually optimize for available RAM and CPUs. The system should do all this work for you. But the graph approach is powerful. The programmer provide the bits of atomic behaviour and the system can then try various optimizations. The code doesn't have to change because the graph can be manipulated abstractly on its own. So you can write something like a SQL query. Then something like a query planner figures out how to execute the query on Dryad.

    Click to read more ...

    Tuesday
    Nov132007

    Flickr Architecture

    Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers. Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it? Site: http://www.flickr.com/

    Information Sources

  • Flickr and PHP (an early document)
  • Capacity Planning for LAMP
  • Federation at Flickr: Doing Billions of Queries a Day by Dathan Pattishall.
  • Building Scalable Web Sites by Cal Henderson from Flickr.
  • Database War Stories #3: Flickr by Tim O'Reilly
  • Cal Henderson's Talks. A lot of useful PowerPoint presentations.

    Platform

  • PHP
  • MySQL
  • Shards
  • Memcached for a caching layer.
  • Squid in reverse-proxy for html and images.
  • Linux (RedHat)
  • Smarty for templating
  • Perl
  • PEAR for XML and Email parsing
  • ImageMagick, for image processing
  • Java, for the node service
  • Apache
  • SystemImager for deployment
  • Ganglia for distributed system monitoring
  • Subcon stores essential system configuration files in a subversion repository for easy deployment to machines in a cluster.
  • Cvsup for distributing and updating collections of files across a network.

    The Stats

  • More than 4 billion queries per day.
  • ~35M photos in squid cache (total)
  • ~2M photos in squid’s RAM
  • ~470M photos, 4 or 5 sizes of each
  • 38k req/sec to memcached (12M objects)
  • 2 PB raw storage (consumed about ~1.5TB on Sunday
  • Over 400,000 photos being added every day

    The Architecture

  • A pretty picture of Flickr's architecture can be found on this slide . A simple depiction is: -- Pair of ServerIron's ---- Squid Caches ------ Net App's ---- PHP App Servers ------ Storage Manager ------ Master-master shards ------ Dual Tree Central Database ------ Memcached Cluster ------ Big Search Engine - The Dual Tree structure is a custom set of changes to MySQL that allows scaling by incrementally adding masters without a ring architecture. This allows cheaper scaling because you need less hardware as compared to master-master setups which always requires double the hardware. - The central database includes data like the 'users' table, which includes primary user keys (a few different IDs) and a pointer to which shard a users' data can be found on.
  • Use dedicated servers for static content.
  • Talks about how to support Unicode.
  • Use a share nothing architecture.
  • Everything (except photos) are stored in the database.
  • Statelessness means they can bounce people around servers and it's easier to make their APIs.
  • Scaled at first by replication, but that only helps with reads.
  • Create a search farm by replicating the portion of the database they want to search.
  • Use horizontal scaling so they just need to add more machines.
  • Handle pictures emailed from users by parsing each email is it's delivered in PHP. Email is parsed for any photos.
  • Earlier they suffered from Master-Slave lag. Too much load and they had a single point of failure.
  • They needed the ability to make live maintenance, repair data, and so forth, without taking the site down.
  • Lots of excellent material on capacity planning. Take a look in the Information Sources for more details.
  • Went to a federated approach so they can scale far into the future: - Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone else's’ blog - Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time. - PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)
  • Shards: - Slice of the main database - Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active. - Shard assignments are from a random number for new accounts - Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.
  • Clicking a Favorite: - Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5) - Pulls my Information from cache, to get my shard location (say on shard-13) - Starts a “distributed transaction” - to answer the question: Who favorited the photo? What are my favorites?
  • Can ask question from any shard, and recover data. Its absolutely redundant.
  • To get rid of replication lag… - every page load, the user is assigned to a bucket - if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistent connections, they build connections and tear it down. Every page load thus, tests the connection.
  • Every users reads and writes are kept in one shard. Notion of replication lag is gone.
  • Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode. To upgrade you just have to shut down half the shard, upgrade that half, and then repeat the process.
  • Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load.
  • Average queries per page, are 27-35 SQL statements. Favorites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages.
  • Over 36,000 queries per second - running within capacity threshold. Burst of traffic, double 36K/qps.
  • Each Shard holds 400K+ users data. - A lot of data is stored twice. For example, a comment is part of the relation between the commentor and the commentee. Where is the comment stored? How about both places? Transactions are used to prevent out of sync data: open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st committed. but there still a chance for failure when a box goes down during the 1st commit.
  • Search: - Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search - Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness) - Think of it such that you’ve got Lucene-like search
  • Hardware: - EMT64 w/RHEL4, 16GB RAM - 6-disk 15K RPM RAID-10. - Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files - the photos are a lot larger). - 2U boxes. Each shard has~120GB of data.
  • Backup procedure: - ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare. - Snapshots are taken every night across the entire cluster of databases. - Writing or deleting several huge backup files at once to a replication filestore can wreck performance on that filestore for the next few hours as it replicates the backup files. Doing this to an in-production photo storage filer is a bad idea. - However much it costs to keep multiple days of backups of all of your data, it's worth it. Keeping staggered backups is good for when you discover something gone wrong a few days later. something like 1, 2, 10 and 30 day backups.
  • Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database.
  • Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards.
  • max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity.
  • Tags: - Tags do not fit well with traditional normalized RDBMs schema design. Denormalization or heavy caching is the only way to generate a tag cloud in milliseconds for hundreds of millions of tags. - Some of their data views are calculated offline by dedicated processing clusters which save the results into MySQL because some relationships are so complicated to calculate it would absorb all the database CPU cycles.
  • Future Direction: - Make it faster with real-time BCP, so all data centers can receive writes to the data layer (db, memcache, etc) all at the same time. Everything is active nothing will ever be idle.

    Lessons Learned

  • Think of your application as more than just a web application. You'll have REST APIs, SOAP APIs, RSS feeds, Atom feeds, etc.
  • Go stateless. Statelessness makes for a simpler more robust system that can handle upgrades without flinching.
  • Re-architecting your database sucks.
  • Capacity plan. Bring capacity planning into the product discussion EARLY. Get buy-in from the $$$ people (and engineering management) that it’s something to watch.
  • Start slow. Don’t buy too much equipment just because you’re scared/happy that your site will explode.
  • Measure reality. Capacity planning math should be based on real things, not abstract ones.
  • Build in logging and metrics. Usage stats are just as important as server stats. Build in custom metrics to measure real-world usage to server-based stats.
  • Cache. Caching and RAM is the answer to everything.
  • Abstract. Create clear levels of abstraction between database work, business logic, page logic, page mark-up and the presentation layer. This supports quick turn around iterative development.
  • Layer. Layering allows developers to create page level logic which designers can use to build the user experience. Designers can ask for page logic as needed. It's a negotiation between the two parties.
  • Release frequently. Even every 30 minutes.
  • Forget about small efficiencies, about 97% of the time. Premature optimization is the root of all evil.
  • Test in production. Build into the architecture mechanisms (config flags, load balancing, etc.) with which you can deploy new hardware easily into (and out of) production.
  • Forget benchmarks. Benchmarks are fine for getting a general idea of capabilities, but not for planning. Artificial tests give artificial results, and the time is better used with testing for real.
  • Find ceilings. - What is the maximum something that every server can do ? - How close are you to that maximum, and how is it trending ? - MySQL (disk IO ?) - SQUID (disk IO ? or CPU ?) - memcached (CPU ? or network ?)
  • Be sensitive to the usage patterns for your type of application. - Do you have event related growth? For example: disaster, news event. - Flickr gets 20-40% more uploads on first work day of the year than any previous peak the previous year. - 40-50% more uploads on Sundays than the rest of the week, on average
  • Be sensitive to the demands of exponential growth. More users means more content, more content means more connections, more connections mean more usage.
  • Plan for peaks. Be able to handle peak loads up and down the stack.

    Click to read more ...

  • Tuesday
    Nov132007

    Friendster Lost Lead Because of a Failure to Scale

    Hey, this scaling stuff might just be important. Jim Scheinman, former Bebo and Friendster exec, puts the blame squarely on Friendster's inability to scale as why they lost the social networking race: VB: Can you tell me a bit about what you learned in your time at Friendster? JS: For me, it basically came down to failed execution on the technology side — we had millions of Friendster members begging us to get the site working faster so they could log in and spend hours social networking with their friends. I remember coming in to the office for months reading thousands of customer service emails telling us that if we didn’t get our site working better soon, they’d be ‘forced to join’ a new social networking site that had just launched called MySpace…the rest is history. To be fair to Friendster’s technology team at the time, they were on the forefront of many new scaling and database issues that web sites simply hadn’t had to deal with prior to Friendster. As is often the case, the early pioneer made critical mistakes that enabled later entrants to the market, MySpace, Facebook & Bebo to learn and excel. As a postscript to the story, it’s interesting to note that Kent Lindstrom (CEO of Friendster) and the rest of the team have done an outstanding job righting that ship. Hopefully with all the quality information out now on the intertubes visionaries can concentrate on making good stuff instead of always fighting the plumbing. When you think about, is there any industry or group that gives so much value away for free as the software community? I don't think so. We are an amazingly giving group and the world has benefited greatly from that impulse. A thought for Thanksgiving.

    Click to read more ...