Dr. Daniel Abadi, author of the DBMS Musings blog and Cofounder of Hadapt, which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in Hadoop's tremendous inefficiency on graph data management (and how to avoid it), which shares the secrets of getting Hadoop to perform 1000x better on graph data.
- Analysing graph data is at the heart of important data mining problems.
- Hadoop is the tool of choice for many of these problems.
- Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be.
- Hadoop inefficiency has consequences in real world. Inefficiencies on graph data problems like improving power utilization, minimizing carbon emissions, and improving product designs, leads to a lot value being left on the table in the form of negative environmental consequences, increased server costs, increased data center space, and increased energy costs.
- 10x improvement by using a clustering algorithm to graph partition data across nodes in the Hadoop cluster. By default in Hadoop data is distributed randomly around a cluster, which means data that's close together in the graph can be very far apart on disk. This is very slow for common operations like sub-graph pattern matching, which prefers neighbors to be stored on the same machine.
- 10x improvement by replicating data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. By default Hadoop replicates data 3 times, treating all data equally is inefficient.
- 10x improvement by replacing the physical storage system with graph-optimized storage. HDFS, which is a distributed file system, and HBase, which is an unstructured data storage system, are not optimal data stores for graph data.
Voila! That's a 10x * 10x * 10x = 1000x performance improvement on graph problems using techniques that make a lot of sense. What may be less obvious is the whole idea of keeping the Hadoop shell and making the component parts more efficient for graph problems. Hadoop stays Hadoop externally, but internally has graph super powers. These are strategies you can use.
What I found most intriguing is thinking about the larger consequences of Hadoop being inefficient. There's more in play than I had previously considered. From the most obvious angle, money, we are used to thinking this way about mass produced items. If a widget can be cost reduced by 10 cents and millions of them are made, we are talking real money. If Hadoop is going to be used for the majority of data mining problem, then making it more efficient adds up to real effects. Going to the next level, the more efficient Hadoop becomes, the quicker important problems facing the world will be solved. Interesting.
- Running Large Graph Algorithms - Evaluation Of Current State-Of-The-Art And Lessons Learned
- Scalable SPARQL Querying of Large RDF Graphs by Jiewen Huang, Daniel J. Abadi, Kun Ren.
- Efﬁcient Processing of Data Warehousing Queriesin a Split Execution Environment by Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, Erik Paulson
- Golden Orb - open-source implementation of Pregel.
- GraphLab - A New Parallel Framework for Machine Learning
- Apples and oranges: a comparison of RDF benchmarks and real RDF datasets by Songyun Duan, Anastasios Kementsietsidis, Kavitha Srinivas, Octavian Udrea
- Clause-Iteration with MapReduce to Scalably Query Data Graphs in the SHARD Graph-Store by Kurt Rohloff, Richard E. Schantz