« Stuff The Internet Says On Scalability For March 15, 2013 | Main | If Your System was a Symphony it Might Sound Like This... »

Iron.io Moved From Ruby to Go: 28 Servers Cut and Colossal Clusterf**ks Prevented

For the last few months I've been programming a system in Go, so I'm always on the lookout for information to feed my confirmation bias. An opportunity popped up when Iron.io wrote about their experience using Go to rewrite IronWorker, their ever busy job execution system, originally coded in Ruby.

The result:

  • Dropped from 30 to 2 servers and the second server was used only for redundancy.
  • CPU utilization dropped to less than 5%.
  • Memory usage dropped. Only a "few hundred KB's of memory (on startup) vs our Rails apps which were ~50MB (on startup)". 
  • Cascading failures are now a thing of the past.
  • New services running on hundreds of servers are all written in Go.
  • They believe using Go allows them to "build great products, to grow and scale, and attract grade A talent. And I believe it will continue to help us grow for the foreseeable future." Picking a language based on the size of the talent pool is a common recommendation, they've found selecting Go helps them attract top talent.
  • Deployment is easy because Go compiles to a single static image.
  • Minor drawbacks with Go: learning a new language and limited libraries.
  • Go is a good option for servers that will get a lot of traffic and if you want to prepare for sudden growth.

Sure, rewrites untouched by the Second System Effect can be a lot faster, but you may recall LinkedIn had a similar experience: LinkedIn Moved From Rails To Node: 27 Servers Cut And Up To 20x Faster

Here's an explanation of the problem Go fixed:

  • With Ruby sustained server CPU usage ranged between 50% and 60%. Servers were added to keep CPU usage at 50% so traffic spikes could be handled gracefully. The downside of this approach is it requires horizontally scaling out expensive servers.
  • They had a very interesting failure mode. When traffic spiked a Rails server would spike to 100% CPU. This caused the server to appear failed, which caused the load balancer to route traffic to the remaining servers, which caused more servers to spike to 100% CPU usage. The end result was cascading failure.

The time to market argument for using Ruby can make a lot of sense. Performance isn't everything, but here we see the value of performance, especially outside the web tier. A well performing service is a huge win in both robustness and cost.

Usually weakness is covered up by quantity and quantity is expensive and doesn't always work.

Performance acts as a buffer, allowing a system to absorb traffic without breaking, enduring sucker punch after sucker punch. Just-in-time allocation, spinning up new instances takes time, long enough that traffic spikes can cause cascading failures. By coding for performance instead of time-to-market you are preventing these problems from ever arising.

Go is Not Perfect

If you look at Go's Google Group, Go is not without performance concerns. Often these issues can be coded around. For example, using bufio.ReadSlice instead of bufio.ReadString removes a data copy and magically your code is X times faster.

It takes time to learn these tricks, especially with such a new language. 

Where Go is definitely at a disadvantage is the many years the JVM and the V8 JavaScript Engine have had optimizing garbage collection and code generation. It may take Go a while to catch up.

Performance is never free. You have to code smart. Minimize shared state, don't churn memory, profile like hell, know your language and do the right thing by it.

Related Articles

Reader Comments (7)

Why Go? Why not Erlang?

March 13, 2013 | Unregistered Commenterasd

Go is irrelevant in this story. The root cause was Ruby. When Ruby is gone and replaced by any industrial language -- could be Java, C# or anything proven -- the problems are gone too.

March 13, 2013 | Unregistered CommenterEnterprise Java Coder

For those who decided moving to Go there is highly optimized library for in-process caching, fast memcache client library and memcache server optimized for SSDs, all written in Go :)

March 13, 2013 | Unregistered Commentervalyala

Still didn't understand whether your problem comes from ruby or rails

March 13, 2013 | Unregistered CommenterSteven Yue

Why not Python?

March 14, 2013 | Unregistered CommenterSave

Waaat? Moved from interpreted to compiled code and performance improved?! These "engineers" must be geniuses!!!

March 15, 2013 | Unregistered Commenterbillg

The fact that they went from 30 servers to 2 is a testament to just how poor a choice Ruby was in the first place. But now they're moving to Go -- an even more obscure language? This makes me question whether they have their customer's needs in mind or whether they simply want to brag about how their on the bleeding edge.

March 16, 2013 | Unregistered Commenterjbsiemens

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>