Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.
In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation:
- Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
- BigData data-base (Cassandra) for storing the historical data and manage the trend analytics
- Use Cloudify (cloudifysource.org) for managing and automating the deployment on private or public cloud
The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique. Then we need persist and process the data with low latency, and for this we store the tweets in memory.
The example was designed to run on a single desktop and with slight configuration change the same example can run on EC2, OpenStack ( HP or Rackpace) as well as on a set of bare-metal non virtualized environment.
The entire source code and step by step guide is provided here