« Stuff The Internet Says On Scalability For May 25, 2012 | Main | Averages, web performance data, and how your analytics product is lying to you »

Build your own twitter like real time analytics - a step by step guide

Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.

In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation: 

  1. Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
  2. BigData data-base (Cassandra) for storing the historical data and manage the trend analytics 
  3. Use Cloudify (cloudifysource.org)  for managing and automating the deployment on private or public cloud

The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique. Then we need persist and process the data with low latency, and for this we store the tweets in memory.

The example was designed to run on a single desktop and with slight configuration change the same example can run on EC2, OpenStack ( HP or Rackpace) as well as on a set of bare-metal non virtualized environment.

The entire source code and step by step guide is provided here 

Reader Comments (4)

Thanks for the article. Might want to correct "pubic cloud" into "public cloud" though.

May 24, 2012 | Unregistered CommenterDev

What would "deployment into the pubic cloud" (quote) mean?

May 25, 2012 | Unregistered Commentertobi

@dev fixed, Thanks for spotting this!
@tobi - deployment for public cloud means that you could provision the entire stack on Amazon, Openstack (HP, Rackspace), Azure using cloudify and get the entire environment seup through a single install command. By environment i'm also referring to management and monitoring and post deployment aspects such as fail-over and scaling.

You can see how this setup works in this specific example here

For more information on cloudify and how it works with different cloud environments i'd suggest that you'll look into cloudifysource.org - you can see some of the live-demos

May 27, 2012 | Registered CommenterNati Shalom

The link to the tutorial and source code is broken.

July 1, 2012 | Unregistered CommenterDan

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>