« Stuff The Internet Says On Scalability For April 17th, 2015 | Main | Full Stack Tuning for a 100x Load Increase and 40x Better Response Times »

Paper: Large-scale cluster management at Google with Borg

Joe Beda (@jbeda): Borg paper is finally out. Lots of reasoning for why we made various decisions in #kubernetes. Very exciting.

The hints and allusions are over. We now have everything about Google's long rumored Borg project in one iconic Google style paper: Large-scale cluster management at Google with Borg.

When Google blew our minds by audaciously treating the Datacenter as a Computer it did not go unnoticed that by analogy there must be an operating system for that datacenter/computer.

Now we have the story behind a critical part of that OS:

Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.

It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.

We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

Virtually all of Google’s cluster workloads have switched to use Borg over the past decade. We continue to evolve it, and have applied the lessons we learned from it to Kubernetes

The next version of Borg was called Omega and Omega is being rolled up into Kubernetes (steersman, helmsman, sailing master), which has been open sourced as part of Google's Cloud initiative.

Note how the world has changed. A decade ago when Google published their industry changing Big Table and Map Reduce papers they launched a thousand open source projects in response. Now we are not only seeing Google open source their software instead of others simply copying the ideas, the software has been released well in advance of the paper describing the software.

The future is still in balance. There's a huge fight going on for the future of what software will look like, how it is built, how it is distributed, and who makes the money. In the search business keeping software closed was a competitive advantage. In the age of AWS the only way to capture hearts and minds is by opening up your software. Interesting times.

Related Articles

Reader Comments (2)

Google's lessons learned from Borg (and increasingly Omega) are fundamental to understand how datacenters will, going forward, be leveraged to innovate quicker & out-think competition. BTW, one doesn't have to wait years to benefit from this new development. Have a look at what the folks at Mesosphere are doing ;)


April 17, 2015 | Unregistered CommenterMichael Hausenblas

I agree with Michael's comments. It's very exciting to see Kubernetes gain so much traction as a system modeled directly after Borg, but built for the masses by the same engineers who implemented Borg and Omega. If I had to chose between a third-party implementation of Google's bespoke systems vs. Google implementing a system themselves for everyone, I think that decision would be pretty easy to make.

June 5, 2015 | Unregistered CommenterJoseph Jacks

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>