advertise
« BankSimple Mini-Architecture - Using a Next Generation Toolchain | Main | Sponsored Post: Newrelic, Cloudkick, Strata, EA, Joyent, CloudSigma, ManageEngine, Site24x7 »
Tuesday
Jan042011

Map-Reduce With Ruby Using Hadoop

Map-Reduce With Hadoop Using Ruby A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.

    Fire-Up Your Hadoop Cluster

    I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting, who started Hadoop and drove it’s development at Yahoo! He also started Lucene, which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster.

    I am going to use Cloudera’s Whirr script, which will allow me to fire up a production ready Hadoop cluster on Amazon EC2 directly from my laptop. Whirr is built on jclouds, meaning other cloud providers should be supported, but only Amazon EC2 has been tested. Once we have Whirr installed, we will configure a hadoop.properties file with our Amazon EC2 credentials and the details of our desired Hadoop cluster. Whirr will use this hadoop.properties file to build the cluster.

    If you are on Debian or Redhat you can use either apt-get or yum to install whirr, but since I’m on Mac OS X, I’ll need to download the Whirr script.

    The current version of Whirr 0.2.0, hosted on the Apache Incubator site, is not compatible with Cloudera’s Distribution for Hadoop (CDH), so I’m am downloading version 0.1.0+23.

    mkdir ~/src/cloudera
    cd ~/src/cloudera
    wget http://archive.cloudera.com/cdh/3/whirr-0.1.0+23.tar.gz
    tar -xvzf whirr-0.1.0+23.tar.gz

    To build Whirr you’ll need to install Java (version 1.6), Maven ( >= 2.2.1) and Ruby ( >= 1.8.7). If you’re running with the latest Mac OS X, then you should have the latest Java and I’ll assume, due to the title of this post, that you can manage the Ruby version. If you are not familiar with Maven, you can install it via Homebrew on Mac OS X using the brew command below. On Debian use apt-get install maven2.

    sudo brew update
    sudo brew install maven

    Once the dependencies are installed we can build the whirr tool.

    cd whirr-0.1.0+23
    mvn clean install
    mvn package -Ppackage

    In true Maven style, it will download a long list of dependencies the first time you build this. Be patient.

    Ok, it should be built now and if you’re anything like me, you would have used the time to get a nice cuppa tea or a sandwich. Let’s sanity check the whirr script…

    bin/whirr version

    You should see something like “Apache Whirr 0.1.0+23″ output to the terminal.

    Create a hadoop.properties file with the following content.

    whirr.service-name=hadoop
    whirr.cluster-name=myhadoopcluster
    whirr.instance-templates=1 jt+nn,1 dn+tt
    whirr.provider=ec2
    whirr.identity=<cloud-provider-identity>
    whirr.credential=<cloud-provider-credential>
    whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
    whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
    whirr.hadoop-install-runurl=cloudera/cdh/install
    whirr.hadoop-configure-runurl=cloudera/cdh/post-configure

    Replace <cloud-provider-identity> and <cloud-provider-credential> with your Amazon EC2 Access Key ID and Amazon EC2 Secret Access Key (I will not tell you what mine is).

    This configuration is a little boring with only two machines. One machine for the master and one machine for the worker. You can get more creative once you are up and running. Let’s fire up our “cluster”.

    bin/whirr launch-cluster --config hadoop.properties

    This is another good time to put the kettle on, as it takes a few minutes to get up and running. If you are curious, or worried that things have come to a halt then Whirr outputs a whirr.log in the current directory. Fire-up another terminal window and tail the log.

    cd ~/src/cloudera/whirr-0.1.0+23
    tail -F whirr.log

    16 minutes (and several cups of tea) later the cluster is up and running. Here is the output I saw in my terminal.

    Launching myhadoopcluster cluster
    Configuring template
    Starting master node
    Master node started: [[id=us-east-1/i-561d073b, providerId=i-561d073b, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-d59d6bbc, os=[name=null, family=amzn-linux, version=2010.11.1-beta, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2010.11.1-beta.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[10.113.23.123], publicAddresses=[72.44.45.199], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]]
    Authorizing firewall
    Starting 1 worker node(s)
    Worker nodes started: [[id=us-east-1/i-98100af5, providerId=i-98100af5, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-d59d6bbc, os=[name=null, family=amzn-linux, version=2010.11.1-beta, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2010.11.1-beta.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[10.116.147.148], publicAddresses=[184.72.179.36], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]]
    Completed launch of myhadoopcluster
    Web UI available at http://ec2-72-44-45-199.compute-1.amazonaws.com
    Wrote Hadoop site file /Users/phil/.whirr/myhadoopcluster/hadoop-site.xml
    Wrote Hadoop proxy script /Users/phil/.whirr/myhadoopcluster/hadoop-proxy.sh
    Started cluster of 2 instances
    HadoopCluster{instances=[Instance{roles=[jt, nn], publicAddress=ec2-72-44-45-199.compute-1.amazonaws.com/72.44.45.199, privateAddress=/10.113.23.123}, Instance{roles=[tt, dn], publicAddress=/184.72.179.36, privateAddress=/10.116.147.148}], configuration={fs.default.name=hdfs://ec2-72-44-45-199.compute-1.amazonaws.com:8020/, mapred.job.tracker=ec2-72-44-45-199.compute-1.amazonaws.com:8021, hadoop.job.ugi=root,root, hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory, hadoop.socks.server=localhost:6666}}

    Whirr has created a directory with some files in our home directory…

    ~/.whirr/myhadoopcluster/hadoop-proxy.sh
    ~/.whirr/myhadoopcluster/hadoop-site.xml

    This hadoop-proxy.sh is used to access the web interface of Hadoop securely. When we run this it will tunnel through to the cluster and give us access in the web browser via a SOCKS proxy.

    You need to configure the SOCKS proxy in either your web browser or, in my case, the Mac OS X settings menu.

    Hadoop SOCKS Proxy Configuration for Mac OS X

    Hadoop SOCKS Proxy Configuration for Mac OS X

    Now start the proxy in your terminal…
    (Note: There has still been no need to ssh into the cluster. Everything in this post is done on our local machine)

    sh ~/.whirr/myhadoopcluster/hadoop-proxy.sh
     
       Running proxy to Hadoop cluster at
       ec2-72-44-45-199.compute-1.amazonaws.com.
       Use Ctrl-c to quit. 
    

    The above will output the hostname that you can access the cluster at. On Amazon EC2 it looks something like http://ec2-72-44-45-199.compute-1.amazonaws.com:50070/dfshealth.jsp. Use this hostname to view the cluster in your web browser.

    http://<hostname>:50070/dfshealth.jsp
    dfshealth.jsp

    HDFS Health Dashboard

    If you click on the link to “Browse the filesystem” then you will notice the hostname changes. This will jump around the data-nodes in your cluster, due to HDFS’s distributed nature. You only currently have one data-node. On Amazon EC2 this new hostname will be the internal hostname of data-node server, which is visible because you are tunnelling through the SOCKS proxy.

    browseDirectory.jsp

    HDFS File Browser

    Ok! It looks as though our Hadoop cluster is up and running. Let’s upload our data.

    Setting Up Your Local Hadoop Client

    To run a map-reduce job on your data, your data needs to be on the Hadoop Distributed File-System. Otherwise known as HDFS. You can interact with Hadoop and HDFS with the hadoop command. We do not have Hadoop installed on our local machine. Therefore, we can either log into one of our Hadoop cluster machines and run the hadoop command from there, or install hadoop on our local machine. I’m going to opt for installing Hadoop on my local machine (recommended), as it will be easier to interact with the HDFS and start the Hadoop map-reduce jobs directly from my laptop.

    Cloudera does not, unfortunately, provide a release of Hadoop for Mac OS X. Only debians and RPMs. They do provide a .tar.gz download, which we are going to use to install Hadoop locally. Hadoop is built with Java and the scripts are written in bash, so there should not be too many problems with compatibility across platforms that can run Java and bash.

    Visit Cloudera CDH Release webpage and select CDH3 Patched Tarball. I downloaded the same version hadoop-0.20.2+737.tar.gz that Whirr installed on the cluster.

    tar -xvzf hadoop-0.20.2+737.tar.gz
    sudo mv hadoop-0.20.2+737 /usr/local/
    cd /usr/local
    sudo ln -s hadoop-0.20.2+737 hadoop
    echo 'export HADOOP_HOME=/usr/local/hadoop' >> ~/.profile
    echo 'export PATH=$PATH:$HADOOP_HOME/bin' >> ~/.profile
    source ~/.profile
    which hadoop # should output "/usr/local/hadoop/bin/hadoop"
    hadoop version # should output "Hadoop 0.20.2+737 ..."
    cp ~/.whirr/myhadoopcluster/hadoop-site.xml /usr/local/hadoop/conf/

    Now run your first command from your local machine to interact with HDFS. This following command is similar to “ls -l /” in bash.

    hadoop fs -ls /

    You should see the following output which lists the root on the Hadoop filesystem.

    10/12/30 18:19:59 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
    Found 4 items
    drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /hadoop
    drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /mnt
    drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /tmp
    drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /user

    Yes, you will see a depreciation warning, since hadoop-site.xml configuration has been split into multiple files. We will not worry about this here.

    Defining The Map-Reduce Task

    We are going write a map-reduce job that scans all the files in a given directory, takes the words found in those files and then counts the number of times words begin with any two characters.

    For this we’re going to use a dictionary file found on my Mac OS X /usr/share/dict/words. It contains 234936 words, each on a newline. Linux has a similar dictionary file.

    Uploading Your Data To HDFS (Hadoop Distributed FileSystem)

    hadoop fs -mkdir input
    hadoop fs -put /usr/share/dict/words input/
    hadoop fs -ls input

    You should see output similar to the following, which list the words file on the remote HDFS. Since my local user is “phil”, Hadoop has added the file under /user/phil on HDFS.

    Found 1 items
    -rw-r--r--   3 phil supergroup    2486813 2010-12-30 18:43 /user/phil/input/words

    Congratulations! You have just uploaded your first file to the Hadoop Distributed File-System on your cluster in the cloud.

    Now we can write our Map-Reduce Ruby scripts. Continue reading...

  • Coding Your Map And Reduce Scripts in Ruby


Reader Comments (1)

Hi,

Thanks very much for putting this post together! I am running into a problem with firing up a cluster though and am wondering if there is something obvious that I have not done. I've asked a question on stack overflow about this as well - and this has the full stack trace of the error. In summary I am seeing:


[~/src/cloudera/whirr-0.1.0+23]$ bin/whirr version rvm:ruby-1.8.7-p299
Apache Whirr 0.1.0+23
[~/src/cloudera/whirr-0.1.0+23]$ bin/whirr launch-cluster --config hadoop.properties rvm:ruby-1.8.7-p299
Launching myhadoopcluster cluster
Exception in thread "main" com.google.inject.CreationException: Guice creation errors:

1) No implementation for java.lang.String annotated with @com.google.inject.name.Named(value=jclouds.credential) was bound.
while locating java.lang.String annotated with @com.google.inject.name.Named(value=jclouds.credential)
for parameter 2 at org.jclouds.aws.filters.FormSigner.(FormSigner.java:91)
at org.jclouds.aws.config.AWSFormSigningRestClientModule.provideRequestSigner(AWSFormSigningRestClientModule.java:66)

Yes, hadoop.properties has my AWS Access Key ID and Secret Access Key (although I did just sign up for them today).

Any help much appreciated!

Thanks,
Navin

January 16, 2011 | Unregistered CommenterNavin

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>