DevOps

We Finally Cracked the 10K Problem - This Time for Managing Servers with 2000x Servers Managed Per Sysadmin

High Scalability

Nov 19, 2013 — 2 min read

In 1999 Dan Kegel issued a big hairy audacious challenge to web servers:

It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.

This became known as the C10K problem. Engineers solved the C10K scalability problems by fixing OS kernels and moving away from threaded servers like Apache to event-driven servers like Nginx and Node.

Today we are considering an even bigger goal, how to support 10 Million Concurrent Connections, which requires even more radical techniques.

No similar challenge was issued for managing servers in a datacenter, but according to Dave Neary from Red Hat, in a recent FLOSS Weekly episode, we have passed the 10K barrier for server management with 10,000 or more servers managed per sysadmin.

Should we let this milestone pass without mention?

Absolutely not! It’s a stunning accomplishment with 200x-2000x increases in productivity. Dave said he remembered in the 1990s it took one sysadmin to manage 4 or 5 Windows servers. A Linux sysadmin could manage 50 to 60 servers.

Now companies are managing over 10,000 servers per sysadmin. This huge change is rooted both in IaaS, treating a datacenter as an elastic programmable resource, divorcing operations from infrastructure deployment, and in the DevOps revolution, with its emphasis on tools, culture, automation, metrics, sharing of resources, and infrastructure as code.

What will it take to manage 10 million servers per sysadmin?

Who might know? Google of course.

As James Hamilton says, Counting Servers is Hard, but Microsoft says they have 1 million servers, and Google is planning for 10 million servers, so it may take a while before we can get to 10 million servers per sysadmin.

But when it does happen the base will be built on:

Treating The Datacenter As A Computer.

And within the datacenter Multiplex Multiple Works Loads On Computers To Increase Machine Utilization And Save Money.

But that’s just a single datacenter. That doesn’t get you 10 to million servers. For 10 million servers you have to exploit many datacenters, so you build a system like Spanner that can scale up to millions of machines across hundreds of datacenters and trillions of database rows.

Then of course you’ll need to create an amazing world spanning network to connect it all together.

But to really get 10 million servers per sysadmin you’ll probably need a huge dose of Deep Learning to make sense of it all.

At a high level the approach of scaling to 10 million connections per server and scaling 10 million machines per sysadmin are the same: scalability is specialization.

But at lower level they differ completely. Scaling to 10 million connections is about removing layers and doing the work yourself. Scaling to 10 million servers is all about putting the intelligence into smarter and smarter layers. A lot like how human body utilizes trillions of individual components mediated by many autonomous systems all directed by a parallelized and decentralized brain.