Map-Reduce With Ruby Using Hadoop

Map-Reduce With Hadoop Using Ruby A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.


Sponsored Post: Newrelic, Cloudkick, Strata, EA, Joyent, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

  • A new round of Membase meetups have been planned for January 2011 for San Diego, Denver, Seattle, Vancouver and Chicago.
  • O'Reilly' Strata Making Data Work Conference on February 1-3, 2011 Santa Clara, CA. Strata is a new conference from O'Reilly, focusing on the business and practice of data.

Cool Products and Services

  • Newrelic - What are you doing to ensure the performance of your apps?
  • Cloudkick - monitor & manage your servers better with a FREE Cloudkick developer account.
  • Join two game developers in a Joyent-sponsored webinar, You’ve Got Game: Planning for the Successful Scale and Performance of Your Cloud-based Game
  • CloudSigma. Instantly scalable European cloud servers.
  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.
  • : Monitor End User Experience from a global monitoring network.

Click to read more ...


Stuff The Internet Says On Scalability For January 3, 2010

 Submitted for your reading pleasure...

  • Quotable Quotes
    • @hofmanndavid: Performance and scalability anxiety makes developers want to catch the flying butterflies
    • @tivrfoa: "Scalability solutions aren't magic. They involve partitioning, indexing and replication." Twitter engineer
    • Alan Perlis: Fools ignore complexity; pragmatists suffer it; experts avoid it; geniuses remove it.
  • CIO update: Post-mortem on the Skype outage. Interesting tale of a cascading collapse in complex, distributed, interactive systems. For more background see the highly illuminating Explaining Supernodes by Dan York.
  • RethinkDB and SSD Databases. SSD was not a revolution by Kevin Burton. What’s really shocking to me, is that while SSD and flash storage is very exciting, it wasn’t as revolutionary in 2010 as I would have liked to have seen.
  • The case for Datastore-Side-Scripting. Russell Sullivan predicts real-time web applications are going in the direction of being entirely event driven, from client (WebSockets) to web-server (Node.js) to datastore (Redisql). And to complete the even driven chain is datastore-side-scripting.
  • Developments that could change everything...

    Click to read more ...


Facebook in 20 Minutes: 2.7M Photos, 10.2M Comments, 4.6M Messages

To celebrate the new year Facebook has shared the results of a little end of the year introspection. It has been a fecund year for Facebook:

  • 43,869,800 changed their status to single
  • 3,025,791 changed their status to "it's complicated"
  • 28,460,516 changed their status to in a relationship
  • 5,974,574 changed their status to engaged
  • 36,774,801 changes their status to married

If these numbers are simply to large to grasp, it doesn't get any better when you look at happens in a mere 20 minutes:

Click to read more ...

Dec292010 Architecture - Pay to Play to Keep a System Small  

How do you keep a system small enough, while still being successful, that a simple scale-up strategy becomes the preferred architecture? StackOverflow, for example, could stick with a tool chain they were comfortable with because they had a natural brake on how fast they could grow: there are only so many programmers in the world. If this doesn't work for you, here's another natural braking strategy to consider: charge for your service

This interesting point, one I hadn't properly considered before, was brought up by Maciej Ceglowski, co-founder of, in an interview with Leo Laporte and Amber MacArthur on their their net@night show.

Pinboard is a lean, mean, pay for bookmarking machine, a timely replacement for the nearly departed Delicious. And as a self professed anti-social bookmarking site, it emphasizes speed over socializing. Maciej considers Pinboard a personal archive, where you can keep a history of what you are reading: forever. When the demise of Delicious was announced, if Pinboard had been a free site they'd have been down immediately, but being a paid site helped flatten out their growth curve.

Bookmarking sites used to about sharing links with your friends, but Twitter has largely taken over that role. Twitter, however, is infamous for presenting only a small slice of your tweet history. What you really want is a big server sucking down your bookmarks from wherever you might bookmark them, and that's just what Pinboard does.

A few points struck me as particularly cool about Pinboard:

Click to read more ...


Netflix: Continually Test by Failing Servers with Chaos Monkey

In 5 Lessons We’ve Learned Using AWS, Netflix's John Ciancutti says the best way to avoid failure is to fail constantly. In the cloud it's expected instances can fail at any time, so you always have to be prepared. In the real world we prepare by running drills. Remember all those exciting fire drills? It's not just fire drills of course. The military, football teams, fire fighters, beach rescue, virtually any entity that must react quickly and efficiently to disaster hones their responsiveness by running drills.

Netflix aggressively moves this strategy into the cloud by randomly failing servers using a tool they built called Chaos Monkey. The idea is:

If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

They respond to failures by degrading service, but they always respond:

Click to read more ...


Paper: CRDTs: Consistency without concurrency control

For a great Christmas read forget The Night Before Christmas, a heart warming poem written by Clement Moore for his children, that created the modern idea of Santa Clause we all know and anticipate each Christmas eve. Instead, curl up with a some potent eggnog, nog being any drink made with rum, and read CRDTs: Consistency without concurrency control by Mihai Letia, Nuno Preguiça, and Marc Shapiro, which talks about CRDTs (Commutative Replicated Data Type), a data type whose operations commute when they are concurrent.

From the introduction, which also serves as a nice concise overview of distributed consistency issues:

Click to read more ...


SQL + NoSQL = Yes !


This is a guest post by Frédéric Faure (architect at Ysance), you can follow him on twitter.

Data storage has always been one of the most difficult problems to address, especially as the quantity of stored data is constantly increasing. This is not simply due to the growing numbers of people regularly using the Internet, particularly with all the social networks, games and gizmos now available. Companies are also amassing more and more meticulous information relevant to their business, in order to optimize productivity and ROI (Return On Investment). I find the positioning of SQL and NoSQL (Not Only SQL) as opposites rather a shame: it’s true that the marketing wave of NoSQL has enabled the renewed promotion of a system that’s been around for quite a while, but which was only rarely considered in most cases, as after all, everything could be fitted into the « good old SQL model ». The reverse trend of wanting to make everything fit the NoSQL model is not very profitable either.

So, what’s new … and what isn’t?

Click to read more ...


Sponsored Post: Electronic Arts, Joyent, Membase, CloudSigma, ManageEngine, Site24x7 

Who's Hiring?

Fun and Informative Events

  • A new round of Membase meetups have been planned for January 2011 for San Diego, Denver, Seattle, Vancouver and Chicago.

Cool Products and Services

Click to read more ...


Netflix: Use Less Chatty Protocols in the Cloud - Plus 26 Fixes

Updated on Friday, February 11, 2011 at 11:26AM by Registered CommenterTodd Hoff

In 5 Lessons We’ve Learned Using AWS, Netflix's John Ciancutti says one of the big lessons they've learned is to create less chatty protocols:

In the Netflix data centers, we have a high capacity, super fast, highly reliable network. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency. We’ve had to be much more structured about “over the wire” interactions, even as we’ve transitioned to a more highly distributed architecture.

There's not a lot of advice out there on how to create protocols. Combine that with a rush to the cloud and you have a perfect storm for chatty applications crushing application performance. Netflix is far from the first to be surprised by the less than stellar networks inside AWS. 

A chatty protocol is one where a client makes a series of requests to a server and the client must wait on each reply before sending the next request. On a LAN this can work great. LAN's are typically fast, wide, and drop few packets.

Move that same application to a different network, one where round trip times can easily be an order of magnitude or larger because either the network is slow, lossy or poorly designed, and if a protocol takes many requests to complete a transaction, then it will make a dramatic difference in performance.

My WAN acceleration friends says Microsoft's Common Internet File System (CIFS) is infamous for being chatty. Transferring a 30MB file could tally something like 300msecs of latency on a LAN. On a WAN that could stretch to 7 minutes. Very unexpected results. What is key here is how the quality characteristics of the pipe interacts with the protocol design.

OK, chatty protocols are bad. What can you do about it?

Click to read more ...