Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
The Peregrine falcon is a bird of prey, famous for its high speed diving attacks, feeding primarily on much slower Hadoops. Wait, sorry, it is Kevin Burton of Spinn3r's new Peregrine project--a new FAST modern map reduce framework optimized for iterative and pipelined map reduce jobs--that feeds on Hadoops.
If you don't know Kevin, he does a lot of excellent technical work that he's kind enough to share it on his blog. Only he hasn't been blogging much lately, he's been heads down working on Peregrine. Now that Peregrine has been released, here's a short email interview with Kevin on why you might want to take up falconry, the ancient sport of MapReduce.
What does Spinn3r do that Peregrine is important to you?
Ideally it was designed to execute pagerank but many iterative applications that we deploy and WANT to deploy (k-means) would be horribly inefficient under Hadoop as it doesn't have any support for merging and joining IO between tasks. It also doesn't support pipeline jobs from one MR task to another without first writing intermediate data to the filesystem.
Peregrine also provides a number of other design decisions which improve total system performance and throughput.
Some of these include use of mmap. We then use mlock to lock pages in memory during certain phases of maps and reduces to improve performance. This allows us to do some zero copy IO and also fadvise away pages from the VFS page cache to prevent Linux from becoming confused and swapping. This is especially annoying when pages will never be read again.
There are other optimizations including support for fallocate so that map chunks are extent based to provide for contiguous data allocation on disk. Even when you have an extent based filesystem like ext4 or XFS you can still end up with fragmentation and degraded performance.
We also avoid using the JVM heap at all and allocate memory directly from the OS... there is just a core 64-128MB of memory used for smaller data structures in the JVM (and classes, etc) but the bulk of the memory used for caches and data structures, us used directly in the OS via anonymous mmap and mmap.
This is done to avoid excessive copying of data between the JVM and the OS which just wastes CPU and is unnecessary.
Further, the JVM can not shrink the heap once it is set/expanded. This means if you have two algorithms, one that uses mmap and another that uses the heap, you have to ditch the efficient mmap version and move all your data into the heap (which is not going to improve performance).
You mention Peregrine initially started off to test some ideas, what were those ideas?
A few actually... in no particular order:
- That a functional MR runtime can be small and tight... around 20k lines of code
- That directly shuffling IO to the reducer nodes is a better design than Hadoop's intermediate shuffling. This also mandates using fully async IO to get past the C10k problem
- That support for merge() along side map and reduce (ALA MapReduceMerge) was important and that we wanted it integrated.
- That one should NOT shy away from using the OS (mmap, fallocate, fadvise, sendfile, etc).
- That support for a cost based job scheduler needs to be integrated into the core (ALA FlumeJava, Cascading, etc).
- That a MR runtime can and should also support higher level runtimes like Pregel without massive changes.
My goal after getting a core Peregrine 1.0 done is to start working on a Pregel runtime backed by a FlumeJava style cost-based optimizer.
What is lacking in the current way of doing things? It sounds a little like using in-memory Hadoop as a pure compute cluster.
I wouldn't say 'in-memory' as much as 'avoid writing to disk' for a number of the optimizations in Peregrine.
For example our direct shuffling means that there's a whole IO phase that is removed... so I have to be amazingly inefficient to even break EVEN with the current indirect shuffling that Hadoop does.
Essentially Hadoop takes all output from a map job, then writes it to the local filesystem.
Back when Hadoop was initially designed this was probably the right course of action but not so much now that it's nearly 2012 and async IO implementations like Netty are available (Peregrine is based on Netty for IO).
What does "running iterative jobs across partitions of data" mean?
If you have two MR jobs , say MR0 and MR1 , and MR1 needs to join against the output from MR0, there is essentially no efficient way to do this with Hadoop because the HDFS blocks aren't stored on disk in a deterministic location.
You basically have to merge the output form MR0 and MR1 into MR' and then sort the records which is amazingly inefficient.
With peregrine, you just have two files which are split by key for MR0 and MR1 ... so all the data is sitting on disk pre-sorted on the SAME physical server. Peregrine just does a fast merge join on disk of the two files and then does a reduce() over the joined output.
This also means that you can take some invariant/static data, and keep it in Peregrine and constantly join against it across hundreds of jobs without having to copy the IO every time.
If you're running PR across 100 iterations this is a major performance improvement. Or even better... say you have DOZENS of PR ranks you want to compute! Or say you have 100 k-means clusters you want to compute, each with different parameters.
Merging against this static data which is pre-sorted can have a massive performance advantage.
This isn't to say at all that Hadoop isn't an amazingly useful and powerful tool. Hive and Pig are awesome and if you're dealing with text files and batch processing them this is probably the right tool chain.
But if you're doing an iterative computation with discrete data structures and you need to join against iterations and performance is a must then you should investigate using Peregrine.