Mar082008 Architecture

Update 3: Always Refer to Your V1 As a Prototype. You really do have to plan to throw one away. Update 2: Lessons Learned Scaling the Audiogalaxy Search Engine. Things he should have done and fun things he couldn’t justify doing. Update: Design details of’s high performance MySQL search engine. At peak times, the search engine needed to handle 1500-2000 searches every second against a MySQL database with about 200 million rows. Search was one of most interesting problems at Audiogalaxy. It was one of the core functions of the site, and somewhere between 50 to 70 million searches were performed every day. At peak times, the search engine needed to handle 1500-2000 searches every second against a MySQL database with about 200 million rows.

Click to read more ...


Product: FAI - Fully Automatic Installation

From their website: FAI is an automated installation tool to install or deploy Debian GNU/Linux and other distributions on a bunch of different hosts or a Cluster. It's more flexible than other tools like kickstart for Red Hat, autoyast and alice for SuSE or Jumpstart for SUN Solaris. FAI can also be used for configuration management of a running system. You can take one or more virgin PCs, turn on the power and after a few minutes Linux is installed, configured and running on all your machines, without any interaction necessary. FAI it's a scalable method for installing and updating all your computers unattended with little effort involved. It's a centralized management system for your Linux deployment. FAI's target group are system administrators who have to install Linux onto one or even hundreds of computers. It's not only a tool for doing a Cluster installation but a general purpose installation tool. It can be used for installing a Beowulf cluster, a rendering farm, a web server farm, or a linux laboratory or a classroom. Even installing a HPC cluster or a GRID and fabric management can be realized by FAI. Large-scale linux networks with different hardware and different installation requirements are easy to establish using FAI and its class concept. Remote OS installations, Linux rollout, mass unattended installation and automated server provisioning are other topics for FAI. The city of Munich is using the combination of GOsa and FAI for their Limux project. Features: * Boot methods: network boot (PXE), CD-ROM, USB stick, floppy disk * Installs Debian, Ubuntu, SuSe, CentOS, Mandriva, Solaris, ... * Centralized installation and configuration management * Installs XEN domains and Vserver

Related Articles

  • FAI wiki
  • FAI the fully automated installation framework for linux from Debian Administration
  • Fully Automatic Installation (FAI) Video Interviewby
  • Rolling Out Unattended Debian Installations by Carla Schroder from LinuxPlanet
  • A talk on fai and debian

    Click to read more ...

  • Saturday

    Product: DRBD - Distributed Replicated Block Device

    From their website: DRBD is a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1. DRBD takes over the data, writes it to the local disk and sends it to the other host. On the other host, it takes it to the disk there. The other components needed are a cluster membership service, which is supposed to be heartbeat, and some kind of application that works on top of a block device. Examples: A filesystem & fsck. A journaling FS. A database with recovery capabilities. Each device (DRBD provides more than one of these devices) has a state, which can be 'primary' or 'secondary'. On the node with the primary device the application is supposed to run and to access the device (/dev/drbdX). Every write is sent to the local 'lower level block device' and to the node with the device in 'secondary' state. The secondary device simply writes the data to its lower level block device. Reads are always carried out locally. If the primary node fails, heartbeat is switching the secondary device into primary state and starts the application there. (If you are using it with a non-journaling FS this involves running fsck) If the failed node comes up again, it is a new secondary node and has to synchronise its content to the primary. This, of course, will happen whithout interruption of service in the background. And, of course, we only will resynchronize those parts of the device that actually have been changed. DRBD has always done intelligent resynchronization when possible. Starting with the DBRD-0.7 series, you can define an "active set" of a certain size. This makes it possible to have a total resync time of 1--3 min, regardless of device size (currently up to 4TB), even after a hard crash of an active node.

    Related Articles

  • How to build a redundant, high-availability system with DRBD and Heartbeat by Pedro Pla in Linux Journal
  • Linux-HA Press Room with many excellent high availability articles.
  • Sync Data on All Servers thread.
  • MySQL clustering strategies and comparisions
  • Wikipedia on DRBD
  • Using Xen for High Availability Clusters by by Kris Buytaert and Johan Huysmans in
  • DRBD for MySQL High Availability

    Click to read more ...

  • Thursday

    Announce: First Meeting of Boston Scalability User Group

    The first meeting will take place on Wednesday March 26 at 6 p.m. in the IBM Innovation Center (Waltham, MA). The first speaker will be Patrick Peralta of Oracle! Patrick will be presenting: Orchestrating Messaging, Data Grid and Database for Scalable Performance. Important Note: There will be pizza at this meeting! The site is at:

    Click to read more ...


    Oprah is the Real Social Network

    A lot of new internet TV station startups are in the wind these days and there's a question about how they can scale their broadcasts. Today's state of the art shows you can't yet mimic the reach of broadcast TV with internet tech. But as Oprah proves, you can still capture a lot of eyeballs, if you are Oprah... Oprah drew a stunning 500,000 simultaneous viewers for an Eckhart Tolle webcast. Move Networks and Limelight Networks hosted the "broadcast" where traffic peaked at 242Gbps. A variable bitrate scheme was used so depending on their connection, a viewer could have seen 150Kbps or as high as 750Kbps. Dan Rayburn thinks The big take away from this webcast is that it shows proof that the Internet is not built to handle TV like distribution and those who think that live TV shows will be broadcast on the Internet with millions and millions of people watching, it's just not going to happen. To handle more users comments suggested capping the bitrate at 300K, using P2P streaming, or using a CDN more specialized in live streaming. I went to Oprah's website and was a bit shocked to find she didn't have full blown social network available. Can you imagine if she did? Oprah's army would seem to be a highly desirable bunch to monetize.

    Click to read more ...


    Manage Downtime Risk by Connecting Multiple Data Centers into a Secure Virtual LAN

    Update: VcubeV - an OpenVPN-based solution designed to build and operate a multisourced infrastructure. True high availability requires a presence in multiple data centers. The recent downtime of even a high quality operation like Amazon makes this need all the more clear. Typically only the big boys can afford the complexity of operating in two or more data centers. Cloud computing along with utility billing starts to change that equation, leveling the playing field. Even smaller outfits will be in a position to manage risk by spreading machines amongst EC2, 3tera, Slicehost, Mosso and other providers. The question then becomes: given we aren't Angels, how do we walk amongst the clouds? One fascinating answer is exquisitely explained by Dmitriy Samovskiy in his Linux Journal article titled Building a Multisourced Infrastructure Using OpenVPN. Dmitriy's idea is to create a secure UDP tunnel between different data centers over public internet links so your application sees a flat virtual network even though the machines run in different data centers. Your machines think they are on the same local network when in reality clusters of machines are maintained in multiple locations communicating over the internet. This impossible sounding task is well described in his article and involves setting up OpenVPN and a lot of tricky bits of configuration. Your reward? Geographical redundancy, encrypted communications, higher fault tolerance, nearest resource routing, better horizontal scalability, and greater vendor independence. Dmitriy points out there are some potential issues with this architecture:

  • Broadcasting and multicasting will not work over the tunnel.
  • Latency over the public network is higher over the public network than it is with your local Ethernet.
  • Tunnels tend to go up and down more than an Ethernet network. Having used a setup like this before it's quite possible to have very fast backbone links connecting data centers so the latency, bandwidth, and connection quality issues can be a lot less than you might think, or they could be an absolute killer. The broadcast/multicast problem did come up, but there always alternative approaches that don't require this ability.

    A Few Questions for Dmitriy

    I asked Dmitriy a few questions and he was kind enough to respond with the following answers: 1. Why would I want to create a virtual LAN rather than create a service layer and access services over http? This depends on what kind of services we are talking about. With hosts in 2 different datacenters which are operated by different hosting companies, and assuming no private connectivity (like a private T1 which you pay for and support), the only way for hosts to talk to each other is via public Internet. If the data your services will be exchanging do not need to be protected from external eyes and you don't need to restrict access directly to services from Internet, then service layer and access over http would definitely be easier. However, if you don't want public access to those services, the first thing we did was have a firewall and restrict who can access which service by IP. For example, we provision machines as needed at Server Beach, one machine at a time (as I said, our operation is currently relatively small). And we handle user auth from LDAP. Whenever we get a new machine, we adjust its firewall and adjust firewalls on all other machines which it's going to communicate with. In our case, we adjusted firewall on LDAP server so a new host could talk to LDAP. With time this peer-to-peer firewall adjusting became too error prone and time consuming as the number of hosts you have goes up. Besides, it breaks change isolation to a certain extent - when bringing up a new host, I have to adjust existing production. In our example - we set up LDAP replica and now all hosts needed to be reconfigured to failover to replica if the primary was not reachable - which meant a lot of firewall changes on multiple hosts. With more services and more hosts, I was dreading we'd end up with a pile of unmanageable firewall rules. Another aspect missing was data encryption when data pass on public Internet links. Was no big deal for us at the moment, but sooner or later everybody starts worrying about this so I took a preemptive shot. Vanilla OpenVPN helped us kill these 2 birds with one stone. We got encryption and once a server has a virtual IP, it's easier to manage firewalls - I choose to manage it on server side (so in our example, on LDAP server). Our dynamic routing script allowed us to have a pair of active-active OpenVPN servers, lack of which would have been a show stopper for me. There are also 2 key benefits of OpenVPN that I like a lot: a. passes through NAT and firewalls (since it's UDP). I can have a machine behind all sorts of firewalls and on network and I still can ssh to it from anywhere in the cloud (using its virtual IP). Works great for VMs with NAT networking type. b. you can assign static virtual IPs to hosts based on ssl key/cert pairs. This comes very handy when you start thinking about Amazon EC2 and their lack of static IP addrs at the moment. 2. Can I connect more than two data center in a pairwise configuration? Yes you can, provided all your hosts that need to connect to VcubeV have physical network connectivity to at least one OpenVPN server (either over LAN or WAN). Plus, at least one OpenVPN server needs to be accessible by the other OpenVPN server. Please see my terrible diagram within the article at . If you want more than 2 OpenVPN servers, please see my (4) below. 3. You mention the downsides are manageable by making certain architectural choices. Could you please describe these? Sure, it's pretty much what I said in the conclusion section in the article. Primarily it's "don't multisource if an app delivers better value when singlesourced." Term "better value" will vary from architect to architect. All of these solutions would require further experimentation. 1. No broadcast or multicast. Solution: look into using OpenVPN on top of `tap' devices instead of `tun'. I personally would not multisource an app that does broadcast or multicast, since it's too low level and imho is likely to have other issues with being deployed in environment which is drastically different from what its designers had in mind. 2. Latency. One depends on public Internet links, so latency can't be controlled. Solution: anticipate latency, application retry logic, adjustable timeouts. If latency is a key aspect of application (trading, for example), don't multisource or at least think twice. 3. Link flapping. Solution: retry logic, avoid long-running TCP connections, forcefully break and re-establish TCP connections regularly, application level heartbeats, use TCP tunnels instead of UDP tunnels, consider data caches (memcached). 4. No more than 2 OpenVPN servers. It's a design limitation of current version of cube-routed. Solution: rewrite cube-routed to share route information using a more advanced protocol that allows many-to-many sharing.

    What Will the Future Look Like?

    It seems clear to me we are going to need a whole new set of tools and infrastructure for managing, deploying, creating, expanding, upgrading, and monitoring applications across multiple clouds. The advantages of multi-cloud deployment are too great to ignore. We need a Data Center API so we can treat all the different clouds as peers and operate on them like one big exposed object instead of individually specialized niches. Will we see real-time markets develop where clouds bid for your network/CPU/storage business and you can dynamically allocate applications to cloud vendors in order to minimize costs?

    Click to read more ...

  • Monday

    Read This Site and Ace Your Next Interview!

    Paul Tyma published a massive and massively good 96 page insider's manual on How to Pass a Silicon Valley Software Engineering Interview. My eyes immediately latched on to one of his key example scenarios, which involves scaling Facebook:


    ● What was Facebook day 1? – A database with a PHP front-end ● In PHP, Java, C#, whatever – How long would it take you to reproduce Facebook's first incarnation? ● A single MySQL instance with some simple queries probably used to happily query the whole userbase.


    ● What is it today? ● Its not about “that stuff you learned in school” – Its about what a company with thousands of (possibly conflicting) queries per second operating on a directed-graph with 50 million nodes ● And of course a few Petabytes of data ● And 99.99% uptime ● Design decision? A Facebook user is (or recently was) currently limited to 5000 friends. If you've been reading all the wisdom contributed to and referenced by this website you might just rock this interview and put a little more money in your pocket. So this site isn't a total waste of time :-) Yet I wonder how we can have 96 pages on interviewing and still not talk about software development at all?

    Click to read more ...


    Two data streams for a happy website

    One of the most important architectural decisions that must be done early on in a scalable web site project is splitting the data flow into two streams: one that is user specific and one that is generic. If this is done properly, the system will be able to grow easily. On the other hand, if the data streams are not separated from the start, then the growth options will be severely limited. Trying to make such a web site scale will be just painting the corpse, and this change will cost a whole lot more when you need to introduce it later (and it is "when" in this case, not "if").

    Click to read more ...


    Product: System Imager - Automate Deployment and Installs

    From their website: SystemImager is software that makes the installation of Linux to masses of similar machines relatively easy. It makes software distribution, configuration, and operating system updates easy, and can also be used for content distribution. SystemImager makes it easy to do automated installs (clones), software distribution, content or data distribution, configuration changes, and operating system updates to your network of Linux machines. You can even update from one Linux release version to another! It can also be used to ensure safe production deployments. By saving your current production image before updating to your new production image, you have a highly reliable contingency mechanism. If the new production enviroment is found to be flawed, simply roll-back to the last production image with a simple update command! Some typical environments include: Internet server farms, database server farms, high performance clusters, computer labs, and corporate desktop environments.

    Related Articles

  • Cluster Admin's article Installing and updating your nodes is an excellent introduction to SystemImager. He says it's fast, scalable, simple, makes it easy to install on running nodes, allows management of different OS images and remote installation on any given group of nodes.
  • Automate Linux installation and recovery with SystemImager by Paul Virijevich

    Click to read more ...

  • Tuesday

    Architecture to Allow High Availability File Upload

    Hi, I was wondering if anyone has found any information on how to architect a system to support high availability file uploads. My scenario: I have an Apache server proxying requests to a bunch of Tomcat Java application servers. When I need to upgrade my site, I stop and upgrade each of the Tomcat servers one at a time. This seems to work well as Apache automatically routes subsequent requests for the stopped app server to the remaining app servers that are up. The problem is that if a user is uploading a file when the app server is stopped, the upload fails and the user has to upload the file again. This is problematic as uploading files is an integral feature of the site and it's frustrating for the users to have to restart their uploads every time I upgrade the site (which I want to be able to do frequently). Has anyone seen any information on how this can be done or have ideas on how this can be architected? I imagine sites like Flickr must have a solution to this problem as I have seen presentations they say that they are able to upgrade their site several times a day without the users noticing. Thanks! Tuyen

    Click to read more ...