Iron.io Moved From Ruby to Go: 28 Servers Cut and Colossal Clusterf**ks Prevented

For the last few months I've been programming a system in Go, so I'm always on the lookout for information to feed my confirmation bias. An opportunity popped up when Iron.io wrote about their experience using Go to rewrite IronWorker, their ever busy job execution system, originally coded in Ruby.

The result:

  • Dropped from 30 to 2 servers and the second server was used only for redundancy.
  • CPU utilization dropped to less than 5%.
  • Memory usage dropped. Only a "few hundred KB's of memory (on startup) vs our Rails apps which were ~50MB (on startup)".
  • Cascading failures are now a thing of the past.
  • New services running on hundreds of servers are all written in Go.
  • They believe using Go allows them to "build great products, to grow and scale, and attract grade A talent. And I believe it will continue to help us grow for the foreseeable future." Picking a language based on the size of the talent pool is a common recommendation, they've found selecting Go helps them attract top talent.
  • Deployment is easy because Go compiles to a single static image.
  • Minor drawbacks with Go: learning a new language and limited libraries.
  • Go is a good option for servers that will get a lot of traffic and if you want to prepare for sudden growth.

Sure, rewrites untouched by the Second System Effect can be a lot faster, but you may recall LinkedIn had a similar experience: LinkedIn Moved From Rails To Node: 27 Servers Cut And Up To 20x Faster.

Here's an explanation of the problem Go fixed:

  • With Ruby sustained server CPU usage ranged between 50% and 60%. Servers were added to keep CPU usage at 50% so traffic spikes could be handled gracefully. The downside of this approach is it requires horizontally scaling out expensive servers.
  • They had a very interesting failure mode. When traffic spiked a Rails server would spike to 100% CPU. This caused the server to appear failed, which caused the load balancer to route traffic to the remaining servers, which caused more servers to spike to 100% CPU usage. The end result was cascading failure.

The time to market argument for using Ruby can make a lot of sense. Performance isn't everything, but here we see the value of performance, especially outside the web tier. A well performing service is a huge win in both robustness and cost.

Usually weakness is covered up by quantity and quantity is expensive and doesn't always work.

Performance acts as a buffer, allowing a system to absorb traffic without breaking, enduring sucker punch after sucker punch. Just-in-time allocation, spinning up new instances takes time, long enough that traffic spikes can cause cascading failures. By coding for performance instead of time-to-market you are preventing these problems from ever arising.

Go is Not Perfect

If you look at Go's Google Group, Go is not without performance concerns. Often these issues can be coded around. For example, using bufio.ReadSlice instead of bufio.ReadString removes a data copy and magically your code is X times faster.

It takes time to learn these tricks, especially with such a new language.

Where Go is definitely at a disadvantage is the many years the JVM and the V8 JavaScript Engine have had optimizing garbage collection and code generation. It may take Go a while to catch up.

Performance is never free. You have to code smart. Minimize shared state, don't churn memory, profile like hell, know your language and do the right thing by it.