Map-Reduce With Ruby Using Hadoop

Map-Reduce With Hadoop Using Ruby

A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.

Fire-Up Your Hadoop Cluster

I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting, who started Hadoop and drove it’s development at Yahoo! He also started Lucene, which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster.

I am going to use Cloudera’s Whirr script, which will allow me to fire up a production ready Hadoop cluster on Amazon EC2 directly from my laptop. Whirr is built on jclouds, meaning other cloud providers should be supported, but only Amazon EC2 has been tested. Once we have Whirr installed, we will configure a hadoop.properties file with our Amazon EC2 credentials and the details of our desired Hadoop cluster. Whirr will use this hadoop.properties file to build the cluster.

If you are on Debian or Redhat you can use either apt-get or yum to install whirr, but since I’m on Mac OS X, I’ll need to download the Whirr script.

The current version of Whirr 0.2.0, hosted on the Apache Incubator site, is not compatible with Cloudera’s Distribution for Hadoop (CDH), so I’m am downloading version 0.1.0+23. mkdir ~/src/cloudera cd ~/src/cloudera wget http://archive.cloudera.com/cdh/3/whirr-0.1.0+23.tar.gz tar -xvzf whirr-0.1.0+23.tar.gz

To build Whirr you’ll need to install Java (version 1.6), Maven ( >= 2.2.1) and Ruby  ( >= 1.8.7). If you’re running with the latest Mac OS X, then you should have the latest Java and I’ll assume, due to the title of this post, that you can manage the Ruby version. If you are not familiar with Maven, you can install it via Homebrew on Mac OS X using the brew command below. On Debian use apt-get install maven2. sudo brew update sudo brew install maven

Once the dependencies are installed we can build the whirr tool. cd whirr-0.1.0+23 mvn clean install mvn package -Ppackage

In true Maven style, it will download a long list of dependencies the first time you build this. Be patient.

Ok, it should be built now and if you’re anything like me, you would have used the time to get a nice cuppa tea or a sandwich. Let’s sanity check the whirr script… bin/whirr version

You should see something like “Apache Whirr 0.1.0+23″ output to the terminal.

Create a hadoop.properties file with the following content. whirr.service-name=hadoop whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 jt+nn,1 dn+tt whirr.provider=ec2 whirr.identity=<cloud-provider-identity> whirr.credential=<cloud-provider-credential> whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub whirr.hadoop-install-runurl=cloudera/cdh/install whirr.hadoop-configure-runurl=cloudera/cdh/post-configure

Replace <cloud-provider-identity> and <cloud-provider-credential> with your Amazon EC2 Access Key ID and Amazon EC2 Secret Access Key (I will not tell you what mine is).

This configuration is a little boring with only two machines. One machine for the master and one machine for the worker. You can get more creative once you are up and running. Let’s fire up our “cluster”. bin/whirr launch-cluster --config hadoop.properties

This is another good time to put the kettle on, as it takes a few minutes to get up and running. If you are curious, or worried that things have come to a halt then Whirr outputs a whirr.log in the current directory. Fire-up another terminal window and tail the log. cd ~/src/cloudera/whirr-0.1.0+23 tail -F whirr.log

16 minutes (and several cups of tea) later the cluster is up and running. Here is the output I saw in my terminal. Launching myhadoopcluster cluster Configuring template Starting master node Master node started: [[id=us-east-1/i-561d073b, providerId=i-561d073b, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-d59d6bbc, os=[name=null, family=amzn-linux, version=2010.11.1-beta, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2010.11.1-beta.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[10.113.23.123], publicAddresses=[72.44.45.199], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]] Authorizing firewall Starting 1 worker node(s) Worker nodes started: [[id=us-east-1/i-98100af5, providerId=i-98100af5, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-d59d6bbc, os=[name=null, family=amzn-linux, version=2010.11.1-beta, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2010.11.1-beta.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[10.116.147.148], publicAddresses=[184.72.179.36], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]] Completed launch of myhadoopcluster Web UI available at http://ec2-72-44-45-199.compute-1.amazonaws.com Wrote Hadoop site file /Users/phil/.whirr/myhadoopcluster/hadoop-site.xml Wrote Hadoop proxy script /Users/phil/.whirr/myhadoopcluster/hadoop-proxy.sh Started cluster of 2 instances HadoopCluster{instances=[Instance{roles=[jt, nn], publicAddress=ec2-72-44-45-199.compute-1.amazonaws.com/72.44.45.199, privateAddress=/10.113.23.123}, Instance{roles=[tt, dn], publicAddress=/184.72.179.36, privateAddress=/10.116.147.148}], configuration={fs.default.name=hdfs://ec2-72-44-45-199.compute-1.amazonaws.com:8020/, mapred.job.tracker=ec2-72-44-45-199.compute-1.amazonaws.com:8021, hadoop.job.ugi=root,root, hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory, hadoop.socks.server=localhost:6666}}

Whirr has created a directory with some files in our home directory… ~/.whirr/myhadoopcluster/hadoop-proxy.sh ~/.whirr/myhadoopcluster/hadoop-site.xml

This hadoop-proxy.sh is used to access the web interface of Hadoop securely. When we run this it will tunnel through to the cluster and give us access in the web browser via a SOCKS proxy.

You need to configure the SOCKS proxy in either your web browser or, in my case, the Mac OS X settings menu.

Hadoop SOCKS Proxy Configuration for Mac OS X

Hadoop SOCKS Proxy Configuration for Mac OS X

Now start the proxy in your terminal…
(Note: There has still been no need to ssh into the cluster. Everything in this post is done on our local machine) sh ~/.whirr/myhadoopcluster/hadoop-proxy.sh Running proxy to Hadoop cluster at ec2-72-44-45-199.compute-1.amazonaws.com. Use Ctrl-c to quit.

The above will output the hostname that you can access the cluster at. On Amazon EC2 it looks something like http://ec2-72-44-45-199.compute-1.amazonaws.com:50070/dfshealth.jsp. Use this hostname to view the cluster in your web browser. http://<hostname>:50070/dfshealth.jsp

dfshealth.jsp

HDFS Health Dashboard

If you click on the link to “Browse the filesystem” then you will notice the hostname changes. This will jump around the data-nodes in your cluster, due to HDFS’s distributed nature. You only currently have one data-node. On Amazon EC2 this new hostname will be the internal hostname of data-node server, which is visible because you are tunnelling through the SOCKS proxy.

browseDirectory.jsp

HDFS File Browser

Ok! It looks as though our Hadoop cluster is up and running. Let’s upload our data.

Setting Up Your Local Hadoop Client

To run a map-reduce job on your data, your data needs to be on the Hadoop Distributed File-System. Otherwise known as HDFS. You can interact with Hadoop and HDFS with the hadoop command. We do not have Hadoop installed on our local machine. Therefore, we can either log into one of our Hadoop cluster machines and run the hadoop command from there, or install hadoop on our local machine. I’m going to opt for installing Hadoop on my local machine (recommended), as it will be easier to interact with the HDFS and start the Hadoop map-reduce jobs directly from my laptop.

Cloudera does not, unfortunately, provide a release of Hadoop for Mac OS X. Only debians and RPMs. They do provide a .tar.gz download, which we are going to use to install Hadoop locally. Hadoop is built with Java and the scripts are written in bash, so there should not be too many problems with compatibility across platforms that can run Java and bash.

Visit Cloudera CDH Release webpage and select CDH3 Patched Tarball.  I downloaded the same version hadoop-0.20.2+737.tar.gz that Whirr installed on the cluster. tar -xvzf hadoop-0.20.2+737.tar.gz sudo mv hadoop-0.20.2+737 /usr/local/ cd /usr/local sudo ln -s hadoop-0.20.2+737 hadoop echo 'export HADOOP_HOME=/usr/local/hadoop' >> ~/.profile echo 'export PATH=$PATH:$HADOOP_HOME/bin' >> ~/.profile source ~/.profile which hadoop # should output "/usr/local/hadoop/bin/hadoop" hadoop version # should output "Hadoop 0.20.2+737 ..." cp ~/.whirr/myhadoopcluster/hadoop-site.xml /usr/local/hadoop/conf/

Now run your first command from your local machine to interact with HDFS. This following command is similar to “ls -l /” in bash. hadoop fs -ls /

You should see the following output which lists the root on the Hadoop filesystem.  10/12/30 18:19:59 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively Found 4 items drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /hadoop drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /mnt drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /tmp drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /user

Yes, you will see a depreciation warning, since hadoop-site.xml configuration has been split into multiple files. We will not worry about this here.

Defining The Map-Reduce Task

We are going write a map-reduce job that scans all the files in a given directory, takes the words found in those files and then counts the number of times words begin with any two characters.

For this we’re going to use a dictionary file found on my Mac OS X /usr/share/dict/words. It contains 234936 words, each on a newline. Linux has a similar dictionary file.

Uploading Your Data To HDFS (Hadoop Distributed FileSystem) hadoop fs -mkdir input hadoop fs -put /usr/share/dict/words input/ hadoop fs -ls input

You should see output similar to the following, which list the words file on the remote HDFS. Since my local user is “phil”, Hadoop has added the file under /user/phil on HDFS. Found 1 items -rw-r--r--   3 phil supergroup    2486813 2010-12-30 18:43 /user/phil/input/words

Congratulations! You have just uploaded your first file to the Hadoop Distributed File-System on your cluster in the cloud. Now we can write our Map-Reduce Ruby scripts. Continue reading...