Weblinks: Problems

Categories

Links in this category and its subcategories

Todd Hoff's picture

Problem: Mobbing the Least Used Resource Error

A thoughtful reader recently suggested creating a series of posts based on real-life problems people have experienced and the solutions they've created to slay the little beasties. It's a great idea. Often we learn best from great trials and tribulations. I'll start off the new "Problem Report"
feature with a diabolical little problem I dubbed the "Mobbing the Least Used Resource Error." Please post your own. And if you know someone with an interesting problem report, please tag them too. It could be a lot of fun. Of course, feel free to scrub your posts of all embarrassing details, but be sure to keep the heroic parts in :-)

The Problem


There's an unexpected and frequently fatal type of error that can happen when new resources are added to a horizontally scaled architecture. Because the new resource has the least of something, load or connections or whatever, a load balancer configured with a least metric will instantaneously direct all new traffic to that new resource. And bam! Your system dies. All the traffic that was meant to be spread across your entire cluster is now directed like a laser beam to one small part of it.

I love this problem because it's such a Heisenberg. Everyone is screaming for more storage space so you bring up a new filer. All new data streams flow to the new filer and it crumbles and crawls because it can't handle the load for the entire system. It's in the very act of turning up more storage you bring your system down. How "cruel world the universe hates me" is that?

Todd Hoff's picture

Skype Failed the Boot Scalability Test: Is P2P fundamentally flawed?

Skype's 220 millions users lost service for a stunning two days. The primary cause for Skype's nightmare (can you imagine the beeper storm that went off?) was a massive global roll-out of a Window's patch triggering the simultaneous reboot of millions of machines across the globe. The secondary cause was a bug in Skype's software that prevented "self-healing" in the face of such attacks. The flood of log-in requests and a lack of "peer-to-peer resources" melted their system.

Who's fault is it? Is Skype to blame? Is Microsoft to blame? Or is the peer-to-peer model itself fundamentally flawed in some way?