Problem: Mobbing the Least Used Resource Error

A thoughtful reader recently suggested creating a series of posts based on real-life problems people have experienced and the solutions they've created to slay the little beasties. It's a great idea. Often we learn best from great trials and tribulations. I'll start off the new "Problem Report" feature with a diabolical little problem I dubbed the "Mobbing the Least Used Resource Error." Please post your own. And if you know someone with an interesting problem report, please tag them too. It could be a lot of fun. Of course, feel free to scrub your posts of all embarrassing details, but be sure to keep the heroic parts in :-)

The Problem

There's an unexpected and frequently fatal type of error that can happen when new resources are added to a horizontally scaled architecture. Because the new resource has the least of something, load or connections or whatever, a load balancer configured with a least metric will instantaneously direct all new traffic to that new resource. And bam! Your system dies. All the traffic that was meant to be spread across your entire cluster is now directed like a laser beam to one small part of it. I love this problem because it's such a Heisenberg. Everyone is screaming for more storage space so you bring up a new filer. All new data streams flow to the new filer and it crumbles and crawls because it can't handle the load for the entire system. It's in the very act of turning up more storage you bring your system down. How "cruel world the universe hates me" is that? Let's say you add database slaves to handle load. Your load balancer redirects traffic to the new slaves, but the slaves are trying to sync, yet they can't sink because they are getting hammered by the new traffic. Down goes Frazier. This is the dark side of partitioning. You partition data to get high performance via parallelization. For example, you hash on the user name to a cluster dedicated to handle those users. Unless your system is very flexible you can't scale anymore by adding resources because you can't repartition the data. All users are handled by their cluster. If you want a different organization you would have to redistribute data across all the clusters. Most systems can't handle that and you end not being able to scale out as easily as you hoped.

The Solution

The solution depends of course on the resource in question. Butting knowing a potential problem is present gives you the heads up you need to avoid destruction.
  • For filers migrate storage from existing filers to the new filers so storage is evened out. Then new storage will be allocated evenly across all the filers.
  • For services have a life cycle state machine indicating when a service is up and ready for work. Simply being alive doesn't mean it's ready.
  • Consistent Hashing to assign resources to a pool of servers in a scalable fashion.
  • For servers use random or round-robin balancing when the load balancer can receive incorrect feedback from pool servers. The Thundering Herd Problem is supposedly the same problem described here, but it doesn't seem the same to me.

    Click to read more ...

  • Wednesday

    YouTube Architecture

    Update 3: 7 Years Of YouTube Scalability Lessons In 30 Minutes and YouTube Strategy: Adding Jitter Isn't A Bug

    Update 2: YouTube Reaches One Billion Views Per Day. That’s at least 11,574 views per second, 694,444 views per minute, and 41,666,667 views per hour. 

    Update: YouTube: The Platform. YouTube adds a new rich set of APIs in order to become your video platform leader--all for free. Upload, edit, watch, search, and comment on video from your own site without visiting YouTube. Compose your site internally from APIs because you'll need to expose them later anyway.

    YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google?

    Information Sources

  • Google Video


  • Apache
  • Python
  • Linux (SuSe)
  • MySQL
  • psyco, a dynamic python->C compiler
  • lighttpd for video instead of Apache

    What's Inside?

    The Stats

  • Supports the delivery of over 100 million videos per day.
  • Founded 2/2005
  • 3/2006 30 million video views/day
  • 7/2006 100 million video views/day
  • 2 sysadmins, 2 scalability software architects
  • 2 feature developers, 2 network engineers, 1 DBA

    Recipe for handling rapid growth

    while (true) { identify_and_fix_bottlenecks(); drink(); sleep(); notice_new_bottleneck(); } This loop runs many times a day.

    Web Servers

  • NetScalar is used for load balancing and caching static content.
  • Run Apache with mod_fast_cgi.
  • Requests are routed for handling by a Python application server.
  • Application server talks to various databases and other informations sources to get all the data and formats the html page.
  • Can usually scale web tier by adding more machines.
  • The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.
  • Python allows rapid flexible development and deployment. This is critical given the competition they face.
  • Usually less than 100 ms page service times.
  • Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.
  • For high CPU intensive activities like encryption, they use C extensions.
  • Some pre-generated cached HTML for expensive to render blocks.
  • Row level caching in the database.
  • Fully formed Python objects are cached.
  • Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.

    Video Serving

  • Costs include bandwidth, hardware, and power consumption.
  • Each video hosted by a mini-cluster. Each video is served by more than one machine.
  • Using a a cluster means: - More disks serving content which means more speed. - Headroom. If a machine goes down others can take over. - There are online backups.
  • Servers use the lighttpd web server for video: - Apache had too much overhead. - Uses epoll to wait on multiple fds. - Switched from single process to multiple process configuration to handle more connections.
  • Most popular content is moved to a CDN (content delivery network): - CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network. - CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory.
  • Less popular content (1-20 views per day) uses YouTube servers in various colo sites. - There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed. - Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior. - Tune RAID controller and pay attention to other lower level issues to help. - Tune memory on each machine so there's not too much and not too little.

    Serving Video Key Points

  • Keep it simple and cheap.
  • Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.
  • Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.
  • Use simple common tools. They use most tools build into Linux and layer on top of those.
  • Handle random seeks well (SATA, tweaks).

    Serving Thumbnails

  • Surprisingly difficult to do efficiently.
  • There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.
  • Thumbnails are hosted on just a few machines.
  • Saw problems associated with serving a lot of small objects: - Lots of disk seeks and problems with inode caches and page caches at OS level. - Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea. - A high number of requests/sec as web pages can display 60 thumbnails on page. - Under such high loads Apache performed badly. - Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20. - Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache. - With so many images setting up a new machine took over 24 hours. - Rebooting machine took 6-10 hours for cache to warm up to not go to disk.
  • To solve all their problems they started using Google's BigTable, a distributed data store: - Avoids small file problem because it clumps files together. - Fast, fault tolerant. Assumes its working on a unreliable network. - Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites. - For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable.


  • The Early Years - Use MySQL to store meta data like users, tags, and descriptions. - Served data off a monolithic RAID 10 Volume with 10 disks. - Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered. - They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach. - Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master. - Updates cause cache misses which goes to disk where slow I/O causes slow replication. - Using a replicating architecture you need to spend a lot of money for incremental bits of write performance. - One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.
  • The later years: - Went to database partitioning. - Split into shards with users assigned to different shards. - Spreads writes and reads. - Much better cache locality which means less IO. - Resulted in a 30% hardware reduction. - Reduced replica lag to 0. - Can now scale database almost arbitrarily.

    Data Center Strategy

  • Used manage hosting providers at first. Living off credit cards so it was the only way.
  • Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements.
  • So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.
  • Use 5 or 6 data centers plus the CDN.
  • Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.
  • Video bandwidth dependent, not really latency dependent. Can come from any colo.
  • For images latency matters, especially when you have 60 images on a page.
  • Images are replicated to different data centers using BigTable. Code looks at different metrics to know who is closest.

    Lessons Learned

  • Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.
  • Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities.
  • Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.
  • Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.
  • Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance.
  • Constant iteration on bottlenecks: - Software: DB, caching - OS: disk I/O - Hardware: memory, RAID
  • You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.

    Click to read more ...

  • Sunday

    Best Practices for Speeding Up Your Web Site

    The Exceptional Performance group at Yahoo! has identified 14 best practices for making web pages faster. These best practices have proven to reduce response times of Yahoo! properties by 25-50%. They focus on the front-end, for example, why it's bad to use "@import" for including stylesheets and why ETags disable browser caching. This google tech talk features these best practices and demonstrate YSlow. Relevant links: 14 Rules for Exceptional Web Performance: YSlow: Check out the book for details: High Performance Web Sites: Essential Knowledge for Front-End Engineers

    Click to read more ...


    DNS-Record TTL on worst case scenarios

    i didnt find a nearly good solution for this problem yet: imagine, you're responsible for a small CDN network (static images), with two different datacenter. the balancing for the two DC is done with a anycast nameservice (a nameserver in every DC, user gets on nearest location). so, one of the scenario is that one of the datacenters goes down completly. you can do a monitoring on the nameserver and only route to the dc which is still alive, no problem. But what about the TTL from the DNS-Records? Tiny TTLs like 2 min. are often ignored by several ISP (e.g. AOL). so, the client doesn't get the IP from the other Datacenter. what could be a solution in this scenario?

    Click to read more ...

    Mar082008 Architecture

    Update 3: Always Refer to Your V1 As a Prototype. You really do have to plan to throw one away. Update 2: Lessons Learned Scaling the Audiogalaxy Search Engine. Things he should have done and fun things he couldn’t justify doing. Update: Design details of’s high performance MySQL search engine. At peak times, the search engine needed to handle 1500-2000 searches every second against a MySQL database with about 200 million rows. Search was one of most interesting problems at Audiogalaxy. It was one of the core functions of the site, and somewhere between 50 to 70 million searches were performed every day. At peak times, the search engine needed to handle 1500-2000 searches every second against a MySQL database with about 200 million rows.

    Click to read more ...


    Product: FAI - Fully Automatic Installation

    From their website: FAI is an automated installation tool to install or deploy Debian GNU/Linux and other distributions on a bunch of different hosts or a Cluster. It's more flexible than other tools like kickstart for Red Hat, autoyast and alice for SuSE or Jumpstart for SUN Solaris. FAI can also be used for configuration management of a running system. You can take one or more virgin PCs, turn on the power and after a few minutes Linux is installed, configured and running on all your machines, without any interaction necessary. FAI it's a scalable method for installing and updating all your computers unattended with little effort involved. It's a centralized management system for your Linux deployment. FAI's target group are system administrators who have to install Linux onto one or even hundreds of computers. It's not only a tool for doing a Cluster installation but a general purpose installation tool. It can be used for installing a Beowulf cluster, a rendering farm, a web server farm, or a linux laboratory or a classroom. Even installing a HPC cluster or a GRID and fabric management can be realized by FAI. Large-scale linux networks with different hardware and different installation requirements are easy to establish using FAI and its class concept. Remote OS installations, Linux rollout, mass unattended installation and automated server provisioning are other topics for FAI. The city of Munich is using the combination of GOsa and FAI for their Limux project. Features: * Boot methods: network boot (PXE), CD-ROM, USB stick, floppy disk * Installs Debian, Ubuntu, SuSe, CentOS, Mandriva, Solaris, ... * Centralized installation and configuration management * Installs XEN domains and Vserver

    Related Articles

  • FAI wiki
  • FAI the fully automated installation framework for linux from Debian Administration
  • Fully Automatic Installation (FAI) Video Interviewby
  • Rolling Out Unattended Debian Installations by Carla Schroder from LinuxPlanet
  • A talk on fai and debian

    Click to read more ...

  • Saturday

    Product: DRBD - Distributed Replicated Block Device

    From their website: DRBD is a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1. DRBD takes over the data, writes it to the local disk and sends it to the other host. On the other host, it takes it to the disk there. The other components needed are a cluster membership service, which is supposed to be heartbeat, and some kind of application that works on top of a block device. Examples: A filesystem & fsck. A journaling FS. A database with recovery capabilities. Each device (DRBD provides more than one of these devices) has a state, which can be 'primary' or 'secondary'. On the node with the primary device the application is supposed to run and to access the device (/dev/drbdX). Every write is sent to the local 'lower level block device' and to the node with the device in 'secondary' state. The secondary device simply writes the data to its lower level block device. Reads are always carried out locally. If the primary node fails, heartbeat is switching the secondary device into primary state and starts the application there. (If you are using it with a non-journaling FS this involves running fsck) If the failed node comes up again, it is a new secondary node and has to synchronise its content to the primary. This, of course, will happen whithout interruption of service in the background. And, of course, we only will resynchronize those parts of the device that actually have been changed. DRBD has always done intelligent resynchronization when possible. Starting with the DBRD-0.7 series, you can define an "active set" of a certain size. This makes it possible to have a total resync time of 1--3 min, regardless of device size (currently up to 4TB), even after a hard crash of an active node.

    Related Articles

  • How to build a redundant, high-availability system with DRBD and Heartbeat by Pedro Pla in Linux Journal
  • Linux-HA Press Room with many excellent high availability articles.
  • Sync Data on All Servers thread.
  • MySQL clustering strategies and comparisions
  • Wikipedia on DRBD
  • Using Xen for High Availability Clusters by by Kris Buytaert and Johan Huysmans in
  • DRBD for MySQL High Availability

    Click to read more ...

  • Thursday

    Announce: First Meeting of Boston Scalability User Group

    The first meeting will take place on Wednesday March 26 at 6 p.m. in the IBM Innovation Center (Waltham, MA). The first speaker will be Patrick Peralta of Oracle! Patrick will be presenting: Orchestrating Messaging, Data Grid and Database for Scalable Performance. Important Note: There will be pizza at this meeting! The site is at:

    Click to read more ...


    Oprah is the Real Social Network

    A lot of new internet TV station startups are in the wind these days and there's a question about how they can scale their broadcasts. Today's state of the art shows you can't yet mimic the reach of broadcast TV with internet tech. But as Oprah proves, you can still capture a lot of eyeballs, if you are Oprah... Oprah drew a stunning 500,000 simultaneous viewers for an Eckhart Tolle webcast. Move Networks and Limelight Networks hosted the "broadcast" where traffic peaked at 242Gbps. A variable bitrate scheme was used so depending on their connection, a viewer could have seen 150Kbps or as high as 750Kbps. Dan Rayburn thinks The big take away from this webcast is that it shows proof that the Internet is not built to handle TV like distribution and those who think that live TV shows will be broadcast on the Internet with millions and millions of people watching, it's just not going to happen. To handle more users comments suggested capping the bitrate at 300K, using P2P streaming, or using a CDN more specialized in live streaming. I went to Oprah's website and was a bit shocked to find she didn't have full blown social network available. Can you imagine if she did? Oprah's army would seem to be a highly desirable bunch to monetize.

    Click to read more ...


    Manage Downtime Risk by Connecting Multiple Data Centers into a Secure Virtual LAN

    Update: VcubeV - an OpenVPN-based solution designed to build and operate a multisourced infrastructure. True high availability requires a presence in multiple data centers. The recent downtime of even a high quality operation like Amazon makes this need all the more clear. Typically only the big boys can afford the complexity of operating in two or more data centers. Cloud computing along with utility billing starts to change that equation, leveling the playing field. Even smaller outfits will be in a position to manage risk by spreading machines amongst EC2, 3tera, Slicehost, Mosso and other providers. The question then becomes: given we aren't Angels, how do we walk amongst the clouds? One fascinating answer is exquisitely explained by Dmitriy Samovskiy in his Linux Journal article titled Building a Multisourced Infrastructure Using OpenVPN. Dmitriy's idea is to create a secure UDP tunnel between different data centers over public internet links so your application sees a flat virtual network even though the machines run in different data centers. Your machines think they are on the same local network when in reality clusters of machines are maintained in multiple locations communicating over the internet. This impossible sounding task is well described in his article and involves setting up OpenVPN and a lot of tricky bits of configuration. Your reward? Geographical redundancy, encrypted communications, higher fault tolerance, nearest resource routing, better horizontal scalability, and greater vendor independence. Dmitriy points out there are some potential issues with this architecture:

  • Broadcasting and multicasting will not work over the tunnel.
  • Latency over the public network is higher over the public network than it is with your local Ethernet.
  • Tunnels tend to go up and down more than an Ethernet network. Having used a setup like this before it's quite possible to have very fast backbone links connecting data centers so the latency, bandwidth, and connection quality issues can be a lot less than you might think, or they could be an absolute killer. The broadcast/multicast problem did come up, but there always alternative approaches that don't require this ability.

    A Few Questions for Dmitriy

    I asked Dmitriy a few questions and he was kind enough to respond with the following answers: 1. Why would I want to create a virtual LAN rather than create a service layer and access services over http? This depends on what kind of services we are talking about. With hosts in 2 different datacenters which are operated by different hosting companies, and assuming no private connectivity (like a private T1 which you pay for and support), the only way for hosts to talk to each other is via public Internet. If the data your services will be exchanging do not need to be protected from external eyes and you don't need to restrict access directly to services from Internet, then service layer and access over http would definitely be easier. However, if you don't want public access to those services, the first thing we did was have a firewall and restrict who can access which service by IP. For example, we provision machines as needed at Server Beach, one machine at a time (as I said, our operation is currently relatively small). And we handle user auth from LDAP. Whenever we get a new machine, we adjust its firewall and adjust firewalls on all other machines which it's going to communicate with. In our case, we adjusted firewall on LDAP server so a new host could talk to LDAP. With time this peer-to-peer firewall adjusting became too error prone and time consuming as the number of hosts you have goes up. Besides, it breaks change isolation to a certain extent - when bringing up a new host, I have to adjust existing production. In our example - we set up LDAP replica and now all hosts needed to be reconfigured to failover to replica if the primary was not reachable - which meant a lot of firewall changes on multiple hosts. With more services and more hosts, I was dreading we'd end up with a pile of unmanageable firewall rules. Another aspect missing was data encryption when data pass on public Internet links. Was no big deal for us at the moment, but sooner or later everybody starts worrying about this so I took a preemptive shot. Vanilla OpenVPN helped us kill these 2 birds with one stone. We got encryption and once a server has a virtual IP, it's easier to manage firewalls - I choose to manage it on server side (so in our example, on LDAP server). Our dynamic routing script allowed us to have a pair of active-active OpenVPN servers, lack of which would have been a show stopper for me. There are also 2 key benefits of OpenVPN that I like a lot: a. passes through NAT and firewalls (since it's UDP). I can have a machine behind all sorts of firewalls and on network and I still can ssh to it from anywhere in the cloud (using its virtual IP). Works great for VMs with NAT networking type. b. you can assign static virtual IPs to hosts based on ssl key/cert pairs. This comes very handy when you start thinking about Amazon EC2 and their lack of static IP addrs at the moment. 2. Can I connect more than two data center in a pairwise configuration? Yes you can, provided all your hosts that need to connect to VcubeV have physical network connectivity to at least one OpenVPN server (either over LAN or WAN). Plus, at least one OpenVPN server needs to be accessible by the other OpenVPN server. Please see my terrible diagram within the article at . If you want more than 2 OpenVPN servers, please see my (4) below. 3. You mention the downsides are manageable by making certain architectural choices. Could you please describe these? Sure, it's pretty much what I said in the conclusion section in the article. Primarily it's "don't multisource if an app delivers better value when singlesourced." Term "better value" will vary from architect to architect. All of these solutions would require further experimentation. 1. No broadcast or multicast. Solution: look into using OpenVPN on top of `tap' devices instead of `tun'. I personally would not multisource an app that does broadcast or multicast, since it's too low level and imho is likely to have other issues with being deployed in environment which is drastically different from what its designers had in mind. 2. Latency. One depends on public Internet links, so latency can't be controlled. Solution: anticipate latency, application retry logic, adjustable timeouts. If latency is a key aspect of application (trading, for example), don't multisource or at least think twice. 3. Link flapping. Solution: retry logic, avoid long-running TCP connections, forcefully break and re-establish TCP connections regularly, application level heartbeats, use TCP tunnels instead of UDP tunnels, consider data caches (memcached). 4. No more than 2 OpenVPN servers. It's a design limitation of current version of cube-routed. Solution: rewrite cube-routed to share route information using a more advanced protocol that allows many-to-many sharing.

    What Will the Future Look Like?

    It seems clear to me we are going to need a whole new set of tools and infrastructure for managing, deploying, creating, expanding, upgrading, and monitoring applications across multiple clouds. The advantages of multi-cloud deployment are too great to ignore. We need a Data Center API so we can treat all the different clouds as peers and operate on them like one big exposed object instead of individually specialized niches. Will we see real-time markets develop where clouds bid for your network/CPU/storage business and you can dynamically allocate applications to cloud vendors in order to minimize costs?

    Click to read more ...