« New Website Design Considerations | Main | YouTube Architecture »

Problem: Mobbing the Least Used Resource Error

A thoughtful reader recently suggested creating a series of posts based on real-life problems people have experienced and the solutions they've created to slay the little beasties. It's a great idea. Often we learn best from great trials and tribulations. I'll start off the new "Problem Report"
feature with a diabolical little problem I dubbed the "Mobbing the Least Used Resource Error." Please post your own. And if you know someone with an interesting problem report, please tag them too. It could be a lot of fun. Of course, feel free to scrub your posts of all embarrassing details, but be sure to keep the heroic parts in :-)

The Problem

There's an unexpected and frequently fatal type of error that can happen when new resources are added to a horizontally scaled architecture. Because the new resource has the least of something, load or connections or whatever, a load balancer configured with a least metric will instantaneously direct all new traffic to that new resource. And bam! Your system dies. All the traffic that was meant to be spread across your entire cluster is now directed like a laser beam to one small part of it.

I love this problem because it's such a Heisenberg. Everyone is screaming for more storage space so you bring up a new filer. All new data streams flow to the new filer and it crumbles and crawls because it can't handle the load for the entire system. It's in the very act of turning up more storage you bring your system down. How "cruel world the universe hates me" is that?

Let's say you add database slaves to handle load. Your load balancer redirects traffic to the new slaves, but the slaves are trying to sync, yet they can't sink because they are getting hammered by the new traffic. Down goes Frazier.

This is the dark side of partitioning. You partition data to get high performance via parallelization. For example, you hash on the user name to a cluster dedicated to handle those users. Unless your system is very flexible you can't scale anymore by adding resources because you can't repartition the data. All users are handled by their cluster. If you want a different organization you would have to redistribute data across all the clusters. Most systems can't handle that and you end not being able to scale out as easily as you hoped.

The Solution

The solution depends of course on the resource in question. Butting knowing a potential problem is present gives you the heads up you need to avoid destruction.

  • For filers migrate storage from existing filers to the new filers so storage is evened out. Then new storage will be allocated evenly across all the filers.
  • For services have a life cycle state machine indicating when a service is up and ready for work. Simply being alive doesn't mean it's ready.
  • Consistent Hashing to assign resources to a pool of servers in a scalable fashion.
  • For servers use random or round-robin balancing when the load balancer can receive incorrect feedback from pool servers.

    The Thundering Herd Problem is supposedly the same problem described here, but it doesn't seem the same to me.
  • Reader Comments (10)

    Hi, when you talk about filer soluction you say to move old documents from filled filers to empty and new filler, because load balancer try to divide request to new "empty" filers ?


    November 29, 1990 | Unregistered Commenterundol

    actually, for web services, its funnier than that.

    If your servers are doing something slow, they have a queue. Then, if one of the servers starts failing (say it stops being able to talk to the database, and returns 500+SOAPFAult on every request, it ends up with the shortest queue. The load balancer then directs more work that way. So instead of 1 request in, say, 8 failing (a random distro), suddenly 1 in 4 fails. This not only gives the customers a harder time, it makes it initially harder to identify the problem (why is 1 in four failing? we have 8 machines?)

    This is why Apache Axis 1.x SOAPFaults always include the hostname in its exceptions (you can turn this off or change the name). That way when callers complain your service is broken, the bug report can identify the problem for you.

    -steve loughran (ex Axis team)

    November 29, 1990 | Unregistered CommenterSteve Loughran

    My favorite is that a sort of broken system causes more issues than a truly broken one sometimes. Much like Steve's example, we've had instances where one of our application servers was online, but never responded. So all our clients were queuing up on that server because they hadn't timed out yet. This rose our total connections past the point that our memcache servers could handle, and down goes the site. Had the server just exploded and not responded at all, everything would've been fine.

    November 29, 1990 | Unregistered CommenterUltimateBrent

    When dealing with TCP-based services (say http), my solution is not to use this load balancing mode at all precisely for this reason. When backend servers change their state very fast, load balancers are always trying to catch up. The higher the load, the more likely they will fall behind - if that happens, they will be distributing connections based on state information from some point in the past (and you never know how far in the past). Round robin (possibly weighted) is much simpler and more reliable, imho.

    November 29, 1990 | Unregistered CommenterDmitriy

    I've also heard this called "The Thundering Herd" problem. I'm actually working on an article for my website right now about one way you can bring resources up slowly to avoid this.

    November 29, 1990 | Unregistered CommenterTom Kleinpeter

    +1 for round-robin balancing in cases where quiet =! not busy. hardware balancing MySQL machines comes immediately to mind. Zawodny covers that in his book with a great example, and at Friendster (when I was there) we saw the same thing. Random, round-robin...whatever it is, the balancing algorithm can't expect to get any correct 'feedback' from the members of its pool in those sitations.

    As for the Herd-Changing-Direction issue, this is a very common occurrence in my experience. Speed one segment of you architecture up, and that bottleneck moves to a different segment. I've seen some db queries sped up by as much as 50ms, and since pages returned faster, clicking went faster, and therefore that 50ms database difference turned into a 10% bump in busy apache processes. Some more dramatic examples come to mind, too many to list. :)

    IMHO, there are lessons in these behaviors, and experience just reinforces them. If you put a 450 engine in a Chevette, you might expect that your transmission and your tires will see some effects.

    November 29, 1990 | Unregistered Commenterallspaw

    > here are lessons in these behaviors, and experience just reinforces them

    That one needs a good name. The Weakest Link Problem? Any other ideas? I see this a lot in writing servers. Speed up one thing so now it fills up queues faster. The queues start dropping requests. The client starts retrying. And it all bombs in fiery retry hell.

    November 29, 1990 | Unregistered CommenterTodd Hoff

    A well configured load balancer should elevate most of these problems with correctly configured healthchecks, warmup times etc.

    Before we had a flexible load balancer we use to manually load the new web/app servers up until it was aligned with the others in the farm.

    new_server.sh servernames
    build/replicate/config/startup/prefetch/monitor/e-mail admin (to turn on the tap of death)

    What really sux is when your beautiful healthchecks get pwned by a content check changed by your developers or CMS and all your farm goes offline while your phoneturns into a sex toy on vibrating accross your desk!! :D

    November 29, 1990 | Unregistered CommenterThomas

    We pretty much route everything back thru our load-balancers to manage connection limits.

    Would rather a tail spin created a steady slow site than a self created DOS. Working closely with developers you can then create 'waiting rooms' where the app in simple terms, sleeps() and tries again. Organizing a queue for these attempts is not solved easily and is still something we are looking at.

    Control and limits are needed, don't kill yourself with your own systems :)

    November 29, 1990 | Unregistered CommenterThomas

    Ideally, you're adding a server back to the 'farm' in a controlled environment so a little hand-holding isn't necessarily out of the question. With a load balancer like LVS configured to do weighted RR/LC, you could add the server in by hand versus your automated config scripts with a very low weight. As the server ramps up to handle that traffic, you can increase the weight so that it handles more and more traffic. Then eventually its resources will have ramped up enough that you can set the weight equal to the other servers. A script could be developed to monitor these newly added servers and slowly ramp up their weight. I've noticed that RR handles this better than LC.

    November 29, 1990 | Unregistered Commentercollidr

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>