Entries in big-data (8)


What is NASA Doing with Big Data? Check this Out


Within the time you read the above sentence, NASA could have collected 1.73 gigabytes of data from around 100 missions which are active currently. NASA doesn’t stop doing this and the rate of collection is growing in an exponential manner. So, managing this kind of data is an uphill task for them. But the data which NASA collects is highly precious and its significance is immense in NASA’s science and research. NASA is trying extremely hard to make this data as approachable and accessible as possible for their daily tasks, various predictions in the universe, and for the human well-being through its innovations and creativity.

In version 2.0 of their “Open Government Plan” in the year 2012, NASA discussed, but did not go deeply into the work they have been doing regarding “Big Data” and they believed that they have much more to explore in this field.

We all know what big data is and what its uses are. So, I don’t think there is any need to mention what really big data is and let’s move on with other topic.

NASA’s Big Data Challenge

Click to read more ...


How to Remove Duplicates in a Large Dataset Reducing Memory Requirements by 99%

This is a guest repost by Suresh Kondamudi from CleverTap.

Dealing with large datasets is often daunting. With limited computing resources, particularly memory, it can be challenging to perform even basic tasks like counting distinct elements, membership check, filtering duplicate elements, finding minimum, maximum, top-n elements, or set operations like union, intersection, similarity and so on

Probabilistic Data Structures to the Rescue

Probabilistic data structures can come in pretty handy in these cases, in that they dramatically reduce memory requirements, while still providing acceptable accuracy. Moreover, you get time efficiencies, as lookups (and adds) rely on multiple independent hash functions, which can be parallelized. We use structures like Bloom filtersMinHashCount-min sketchHyperLogLog extensively to solve a variety of problems. One fairly straightforward example is presented below.

The Problem

We at CleverTap manage mobile push notifications for our customers, and one of the things we need to guard against is sending multiple notifications to the same user for the same campaign. Push notifications are routed to individual devices/users based on push notification tokens generated by the mobile platforms. Because of their size (anywhere from 32b to 4kb), it’s non-performant for us to index push tokens or use them as the primary user key.

On certain mobile platforms, when a user uninstalls and subsequently re-installs the same app, we lose our primary user key and create a new user profile for that device. Typically, in that case, the mobile platform will generate a new push notification token for that user on the reinstall. However, that is not always guaranteed. So, in a small number of cases we can end up with multiple user records in our system having the same push notification token.

As a result, to prevent sending multiple notifications to the same user for the same campaign, we need to filter for a relatively small number of duplicate push tokens from a total dataset that runs from hundreds of millions to billions of records. To give you a sense of proportion, the memory required to filter just 100 Million push tokens is 100M * 256 = 25 GB!

The Solution – Bloom filter

Click to read more ...


Making Hadoop Run Faster

Making Hadoop Run Faster

One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases.

Batch Processing to the Rescue

Hadoop was designed to deal with this challenge in the following ways:

1. Use a distributed file system: This enables us to spread the load and grow our system as needed.

2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds.

3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed.

Batch Processing Challenges

Click to read more ...


Build your own twitter like real time analytics - a step by step guide

Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.

In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation: 

  1. Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
  2. BigData data-base (Cassandra) for storing the historical data and manage the trend analytics 
  3. Use Cloudify (  for managing and automating the deployment on private or pubic cloud

The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique. Then we need persist and process the data with low latency, and for this we store the tweets in memory.

Click to read more ...


Big Data In the Cloud Using Cloudify

Edd Dumbill wrote an interesting article on O’Reilly Radar covering the current solutions for running Big Data in the Cloud

Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide.

Big PaaS

Edd touched briefly on the role of PaaS for delivering Big Data applications in the cloud

Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or platform as a service (PaaS), these services remove the need to ucale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer.

Click to read more ...


The Real News is Not that Facebook Serves Up 1 Trillion Pages a Month…

It’s how much load that really generates and how it scales to meet the challenge.

imageThere’s some amount of debate whether Facebook really crossed over the one trillion page view per month threshold. While one report says it did, another respected firm says it did not; that its monthly page views are a mere 467 billion per month.

In the big scheme of things, the discrepancy is somewhat irrelevant, as neither show the true load on Facebook’s infrastructure – which is far more impressive a set of numbers than its externally measured “page view” metric.

Click to read more ...


Big Data Application Platform

It's time to think of the architecture and application platforms surrounding "Big Data" databases. Big Data is often centered around new database technologies mostly from the emerging NoSQL world. The main challenge that these databases solve is how to handle massive amount of data at a reasonable cost and without poor performanc - distributed databases emerged to address this challenge and today we're seeing high adoption rate and quite impressive success stories such as the Netflix use of Cassandra/DataStax solution. All that indicate the speed in which this market evolves.

The need for a Big Data Application Platform

Click to read more ...


Building your own Facebook Realtime Analytics System  

Recently, I was reading Todd Hoff's write-up on FaceBook real time analytics system. As usual, Todd did an excellent job in summarizing this video from Engineering Manager at Facebook Alex Himel.

In the first post, I’d like to summarize the case study, and consider some things that weren't mentioned in the summaries. This will lead to an architecture for building your own Realtime Time Analytics for Big-Data that might be easier to implement, using Facebook's experience as a starting point and guide as well as the experience gathered through a recent work with few of GigaSpaces customers. The second post provide a summary of that new approach as well as a pattern and a demo for building your own Real Time Analytics system..

Click to read more ...