Making Hadoop Run Faster
One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases.
Batch Processing to the Rescue
Hadoop was designed to deal with this challenge in the following ways:
1. Use a distributed file system: This enables us to spread the load and grow our system as needed.
2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds.
3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed.
Batch Processing Challenges
Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.
In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation:
- Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
- BigData data-base (Cassandra) for storing the historical data and manage the trend analytics
- Use Cloudify (cloudifysource.org) for managing and automating the deployment on private or pubic cloud
The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique. Then we need persist and process the data with low latency, and for this we store the tweets in memory.
Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide.
Edd touched briefly on the role of PaaS for delivering Big Data applications in the cloud
Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or platform as a service (PaaS), these services remove the need to ucale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer.
The InfoQ article Finding the Right Data Solution for Your Application in the Data Storage Haystack makes a series of concrete recommendations for a user who wants to find the right storage solution for his application.
Few years back, there was a time SQL RDBMS were solution for almost all storage needs, but we all know how scaling came along and shattered the perfect dream. Then NoSQL happened, and now we are end up with a Haystack of solutions. For example, Local memory, Relational, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc. are some classes of such solutions.
We discuss about how to find the right storage solution, and we make choices often when we design. But, when comes to describe how to select the right one, we often end up giving very high-level guideline. The article argues that the way to make more concrete recommendations is to drill down into bit more detail and consider them case by case.
To that end the article takes four parameters about an application/usecase (Scale, Consistency, Type of Data, and Queries needed), then take some 40+ cases that arises from different value combination of those parameters and make one or more concrete recommendations on right storage solution for that case.
What follows are the four parameters and potential values they can take and the recommendations for structured, semi-structured, and unstructured data:
It's time to think of the architecture and application platforms surrounding "Big Data" databases. Big Data is often centered around new database technologies mostly from the emerging NoSQL world. The main challenge that these databases solve is how to handle massive amount of data at a reasonable cost and without poor performanc - distributed databases emerged to address this challenge and today we're seeing high adoption rate and quite impressive success stories such as the Netflix use of Cassandra/DataStax solution. All that indicate the speed in which this market evolves.
The need for a Big Data Application Platform
In the first post, I’d like to summarize the case study, and consider some things that weren't mentioned in the summaries. This will lead to an architecture for building your own Realtime Time Analytics for Big-Data that might be easier to implement, using Facebook's experience as a starting point and guide as well as the experience gathered through a recent work with few of GigaSpaces customers. The second post provide a summary of that new approach as well as a pattern and a demo for building your own Real Time Analytics system..
In this post i wanted to spend sometime on the CAP theorem and clarify some of the confusion that i often see when people associate CAP with scalability without fully understanding the implications that comes with it and the alternative approaches
You can read the full article here
The last six months have been exciting for Digg's engineering team. We're working on a soup-to-nuts rewrite. Not only are we rewriting all our application code, but we're also rolling out a new client and server architecture. And if that doesn't sound like a big enough challenge, we're replacing most of our infrastructure components and moving away from LAMP.
Perhaps our most significant infrastructure change is abandoning MySQL in favor of a NoSQL alternative. To someone like me who's been building systems almost exclusively on relational databases for almost 20 years, this feels like a bold move.
What's Wrong with MySQL?
Our primary motivation for moving away from MySQL is the increasing difficulty of building a high performance, write intensive, application on a data set that is growing quickly, with no end in sight. This growth has forced us into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead.
Relational database technology can be a blunt instrument and we're motivated to find a tool that matches our specific needs closely. Our domain area, news, doesn't exact strict consistency requirements, so (according to Brewer's theorem) relaxing this allows gains in availability and partition tolerance (i.e. operations completing, even in degraded system states). We're confident that our engineers can implement application level consistency controls much more efficiently than MySQL does generically.
As our system grows, it's important for us to span multiple data centers for redundancy and network performance and to add capacity or replace failed nodes with no downtime. We plan to continue using commodity hardware, and to continue assuming that it will fail regularly. All of this is increasingly difficult with MySQL.
Terrastore is a new-born document store which provides advanced scalability and elasticity features without sacrificing consistency.
Here are a few highlights:
- Ubiquitous: based on the universally supported HTTP protocol.
- Distributed: nodes can run and live everywhere on your network.
- Elastic: you can add and remove nodes dynamically to/from your running cluster with no downtime and no changes at all to your configuration.
- Scalable at the data layer: documents are partitioned and distributed among your nodes, with automatic and transparent re-balancing when nodes join and leave.
- Scalable at the computational layer: query and update operations are distributed to the nodes which actually holds the queried/updated data, minimizing network traffic and spreading computational load.
- Consistent: providing per-document consistency, you're guaranteed to always get the latest value of a single document, with read committed isolation for concurrent modifications.
- Schemaless: providing a collection-based interface holding JSON documents with no pre-defined schema, you can just create your collections and put everything you want into.
- Easy operations: install a fully working cluster in just a few commands and no XML to edit.
- Features rich: support for push-down predicates, range queries and server-side update functions.
Digg has been researching ways to scale our database infrastructure for some time now. We’ve adopted a traditional vertically partitioned master-slave configuration with MySQL, and also investigated sharding MySQL with IDDB. Ultimately, these solutions left us wanting. In the case of the traditional architecture, the lack of redundancy on the write masters is painful, and both approaches have significant management overhead to keep running.
Since it was already necessary to abandon data normalization and consistency to make these approaches work, we felt comfortable looking at more exotic, non-relational data stores. After considering HBase, Hypertable, Cassandra, Tokyo Cabinet/Tyrant, Voldemort, and Dynomite, we settled on Cassandra.
Each system has its own strengths and weaknesses, but Cassandra has a good blend of everything. It offers column-oriented data storage, so you have a bit more structure than plain key/value stores. It operates in a distributed, highly available, peer-to-peer cluster. While it’s currently lacking some core features, it gets us closer to where we want to be than the other solutions.