Tailrank Architecture - Learn How to Track Memes Across the Entire Blogosphere

Todd Hoff's picture

Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?

This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.

Sites

  • Tailrank - We track the hottest news in the blogosphere!
  • Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
  • Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.

    Platform

  • MySQL
  • Java
  • Linux (Debian)
  • Apache
  • Squid
  • PowerDNS
  • DAS storage.
  • Federated database.
  • ServerBeach hosting.
  • Job scheduling system for work distribution.

    Interview

  • What is your system is for?

    Tailrank originally a memetracker to track the hottest news being discussed
    within the blogosphere.

    We started having a lot of requests to license our crawler and we shipped that
    in the form of Spinn3r about 8 months ago.

    Spinn3r is self contained crawler for companies that want to index the full
    blogosphere and consumer generated media.

    Tailrank is still a very important product alongside Spinn3r and we're working
    on Tailrank 3.0 which should be available in the future. No ETA at the moment
    but it's actively being worked on.

  • What particular design/architecture/implementation challenges does your system have?

    The biggest challenge we have is the sheer amount of data we have to process and
    keeping that data consistent within a distributed system.

    For example, we process 52TB of content per month. this has to be indexed in a
    highly available storage architecture so the normal distributed database
    problems arise.

  • What did you do to meet these challenges?

    We've spent a lot of time in building out a distributed system that can scale
    and handle failure.

    For example, we've built a tool called Task/Queue that is analogous to Google's
    MapReduce. It has a centralized queue server which hands out units of work to
    robots which make requests.

    It works VERY well for crawlers in that slower machines just fetch work at a
    slower rate while more modern machines (or better tuned machines) request work
    at a higher rate.

    This ends up easily solving one of the main distributed computing fallacies that
    the network is homogeneous.

    Task/Queue is generic enough that we could actually use it to implement
    MapReduce on top of the system.

    We'll probably open source it at some point. Right now it has too many
    tentacles wrapped into other parts of our system.

  • How big is your system?

    We index 24M weblogs and feeds per hour and process content at about
    160-200Mbps.

    At the raw level we're writing to our disks at about 10-15MBps continuously.

  • How many documents, do you serve? How many images? How much data?

    Right now the database is about 500G. We're expecting it to grow well beyond
    this in 2008 as we expand our product offering.

  • What is your rate of growth?

    It's mostly a function of customer feature requests. If our customers want more data we sell it to them.

    In 2008 we're planning on expanding our cluster to index larger portions of the
    web and consumer generated media.

  • What is the architecture of your system?

    We use Java, MySQL and Linux for our cluster.

    Java is a great language for writing crawlers. The library support is pretty
    solid (though it seems like Java 7 is going to be killer when they add
    closures).

    We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
    spending about 20% of my time fixing MySQL bugs and limitations.

    Of course nothing is perfect. MySQL for example was really designed to be used
    on single core systems.

    The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.

    I recently blogged about how these the new multi-core machines should really be
    considered N machines instead of one logical unit: Distributed Computing Fallacy #9.

  • How is your system architected to scale?

    We use a federated database system so that we can split the write load as we see
    more IO.

    We've released a lot of our code as Open Source a lot of our infrastructure and
    this will probably be released as Open Source as well.

    We've already opened up a lot of our infrastructure code:

  • http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
  • http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
  • http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
  • http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
  • http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
  • http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.

  • How many servers do you have?

    About 15 machines so far. We've spent a lot of time tuning our infrastructure
    so it's pretty efficient. That said, building a scalable crawler is not an easy
    task so it does take a lot of hardware.

    We're going to be expanding FAR past this in 2008 and will probably hit about
    2-3 racks of machines (~120 boxes).

  • What operating systems do you use?

    Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
    why more hardware vendors don't support Debian.

    Debian is the big secret in the valley that no one talks about. Most of the big
    web 2.0 shops like Technorati, Digg, etc use Debian.

  • Which web server do you use?

    Apache 2.0. Lighttpd is looking interesting as well.

  • Which reverse proxy do you use?

    About 95% of the pages of Tailrank are served from Squid.

  • How is your system deployed in data centers?

    We use ServerBeach for hosting. It's a great model for small to medium sized
    startups. They rack the boxes, maintain inventory, handle network, etc. We
    just buy new machines and pay a flat markup.

    I wish Dell, SUN, HP would sell directly to clients in this manner.

    One right now. We're looking to expand into two for redundancy.

  • What is your storage strategy?

    Directly attached storage. We buy two SATA drives per box and set them up in
    RAID 0.

    We use the redundant array of inexpensive databases solution so if an individual
    machine fails there's another copy of the data on another box.

    Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.

  • Do you have a standard API to your website?

    Tailrank has RSS feeds for every page.

    The Spinn3r service is itself an API and we have extensive documentation on the
    protocol.

    It's also free to use for researchers so if any of your readers are pursuing a
    Ph.D and generally doing research work and needs access to blog data we'd love
    to help them out.

    We already have the Ph.D students at the University of Washington and University
    of Maryland (my Alma Matter) using Spinn3r.

  • Which DNS service do you use?

    PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
    It uses async IO though so it doesn't really scale across processors on
    multicore boxes. Apparenty there's a hack to get it to run across cores but it
    isn't very reliable.

    AAA caching might be broken though. I still need to look into this.

  • Who do you admire?

    Donald Knuth is the man!

  • How are you thinking of changing your architecture in the future?

    We're still working on finishing up a fully sharded database. MySQL fault
    tolerance and autopromotion is also an issue.

  • Comments

    just swallow hard

    MySQL fault
    tolerance and autopromotion is also an issue.

    When a database is that important to you, you should treat it like any other infrastructure commodity where existing viable alternatives already exist:

    swallow hard and pony up for the best.

    "best" is in the eye of the beholder, so I leave that to the reader.

    another meme tracker

    Chinese memeTracker Site: http://www.onejoo.com/
    Monitoring more than 5 million bloggers,forums and news.

    Re: Tailrank Architecture - Learn How to Track Memes Across the

    How do you support fulltext search with InnoDB?

    -Ivo

    http://www.web20friends.net

    Re: Tailrank Architecture - Learn How to Track Memes Across the

    Hey Ivo.

    We don't use full text search on innodb.... we actually load up a snapshot of the data into myisam and use that for fulltext search.

    For a small document set MySQL full text search is somewhat respectable.

    Re: Tailrank Architecture - Learn How to Track Memes Across the

    How come the tailrank feedparser wiki is full of porn links in the form of tickets?
    http://code.tailrank.com/feedparser/report/6

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.

    Post new comment

    The content of this field is kept private and will not be shown publicly.
    • Web page addresses and e-mail addresses turn into links automatically.
    • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd><div ?=?><p ?=?> <img ?=?> <embed ?=?> <h1 ?=?><h2 ?=?><h3 ?=?>
    • Lines and paragraphs break automatically.
    • Glossary terms will be automatically marked with links to their descriptions
    • You may link to webpages through the weblinks registry

    More information about formatting options

    To combat spam, please enter the code in the image.