Product: DRBD - Distributed Replicated Block Device

From their website: DRBD is a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1. DRBD takes over the data, writes it to the local disk and sends it to the other host. On the other host, it takes it to the disk there. The other components needed are a cluster membership service, which is supposed to be heartbeat, and some kind of application that works on top of a block device. Examples: A filesystem & fsck. A journaling FS. A database with recovery capabilities. Each device (DRBD provides more than one of these devices) has a state, which can be 'primary' or 'secondary'. On the node with the primary device the application is supposed to run and to access the device (/dev/drbdX). Every write is sent to the local 'lower level block device' and to the node with the device in 'secondary' state. The secondary device simply writes the data to its lower level block device. Reads are always carried out locally. If the primary node fails, heartbeat is switching the secondary device into primary state and starts the application there. (If you are using it with a non-journaling FS this involves running fsck) If the failed node comes up again, it is a new secondary node and has to synchronise its content to the primary. This, of course, will happen whithout interruption of service in the background. And, of course, we only will resynchronize those parts of the device that actually have been changed. DRBD has always done intelligent resynchronization when possible. Starting with the DBRD-0.7 series, you can define an "active set" of a certain size. This makes it possible to have a total resync time of 1--3 min, regardless of device size (currently up to 4TB), even after a hard crash of an active node.

Related Articles

  • How to build a redundant, high-availability system with DRBD and Heartbeat by Pedro Pla in Linux Journal
  • Linux-HA Press Room with many excellent high availability articles.
  • Sync Data on All Servers thread.
  • MySQL clustering strategies and comparisions
  • Wikipedia on DRBD
  • Using Xen for High Availability Clusters by by Kris Buytaert and Johan Huysmans in
  • DRBD for MySQL High Availability

    Click to read more ...

  • Thursday

    Announce: First Meeting of Boston Scalability User Group

    The first meeting will take place on Wednesday March 26 at 6 p.m. in the IBM Innovation Center (Waltham, MA). The first speaker will be Patrick Peralta of Oracle! Patrick will be presenting: Orchestrating Messaging, Data Grid and Database for Scalable Performance. Important Note: There will be pizza at this meeting! The site is at:

    Click to read more ...


    Oprah is the Real Social Network

    A lot of new internet TV station startups are in the wind these days and there's a question about how they can scale their broadcasts. Today's state of the art shows you can't yet mimic the reach of broadcast TV with internet tech. But as Oprah proves, you can still capture a lot of eyeballs, if you are Oprah... Oprah drew a stunning 500,000 simultaneous viewers for an Eckhart Tolle webcast. Move Networks and Limelight Networks hosted the "broadcast" where traffic peaked at 242Gbps. A variable bitrate scheme was used so depending on their connection, a viewer could have seen 150Kbps or as high as 750Kbps. Dan Rayburn thinks The big take away from this webcast is that it shows proof that the Internet is not built to handle TV like distribution and those who think that live TV shows will be broadcast on the Internet with millions and millions of people watching, it's just not going to happen. To handle more users comments suggested capping the bitrate at 300K, using P2P streaming, or using a CDN more specialized in live streaming. I went to Oprah's website and was a bit shocked to find she didn't have full blown social network available. Can you imagine if she did? Oprah's army would seem to be a highly desirable bunch to monetize.

    Click to read more ...


    Manage Downtime Risk by Connecting Multiple Data Centers into a Secure Virtual LAN

    Update: VcubeV - an OpenVPN-based solution designed to build and operate a multisourced infrastructure. True high availability requires a presence in multiple data centers. The recent downtime of even a high quality operation like Amazon makes this need all the more clear. Typically only the big boys can afford the complexity of operating in two or more data centers. Cloud computing along with utility billing starts to change that equation, leveling the playing field. Even smaller outfits will be in a position to manage risk by spreading machines amongst EC2, 3tera, Slicehost, Mosso and other providers. The question then becomes: given we aren't Angels, how do we walk amongst the clouds? One fascinating answer is exquisitely explained by Dmitriy Samovskiy in his Linux Journal article titled Building a Multisourced Infrastructure Using OpenVPN. Dmitriy's idea is to create a secure UDP tunnel between different data centers over public internet links so your application sees a flat virtual network even though the machines run in different data centers. Your machines think they are on the same local network when in reality clusters of machines are maintained in multiple locations communicating over the internet. This impossible sounding task is well described in his article and involves setting up OpenVPN and a lot of tricky bits of configuration. Your reward? Geographical redundancy, encrypted communications, higher fault tolerance, nearest resource routing, better horizontal scalability, and greater vendor independence. Dmitriy points out there are some potential issues with this architecture:

  • Broadcasting and multicasting will not work over the tunnel.
  • Latency over the public network is higher over the public network than it is with your local Ethernet.
  • Tunnels tend to go up and down more than an Ethernet network. Having used a setup like this before it's quite possible to have very fast backbone links connecting data centers so the latency, bandwidth, and connection quality issues can be a lot less than you might think, or they could be an absolute killer. The broadcast/multicast problem did come up, but there always alternative approaches that don't require this ability.

    A Few Questions for Dmitriy

    I asked Dmitriy a few questions and he was kind enough to respond with the following answers: 1. Why would I want to create a virtual LAN rather than create a service layer and access services over http? This depends on what kind of services we are talking about. With hosts in 2 different datacenters which are operated by different hosting companies, and assuming no private connectivity (like a private T1 which you pay for and support), the only way for hosts to talk to each other is via public Internet. If the data your services will be exchanging do not need to be protected from external eyes and you don't need to restrict access directly to services from Internet, then service layer and access over http would definitely be easier. However, if you don't want public access to those services, the first thing we did was have a firewall and restrict who can access which service by IP. For example, we provision machines as needed at Server Beach, one machine at a time (as I said, our operation is currently relatively small). And we handle user auth from LDAP. Whenever we get a new machine, we adjust its firewall and adjust firewalls on all other machines which it's going to communicate with. In our case, we adjusted firewall on LDAP server so a new host could talk to LDAP. With time this peer-to-peer firewall adjusting became too error prone and time consuming as the number of hosts you have goes up. Besides, it breaks change isolation to a certain extent - when bringing up a new host, I have to adjust existing production. In our example - we set up LDAP replica and now all hosts needed to be reconfigured to failover to replica if the primary was not reachable - which meant a lot of firewall changes on multiple hosts. With more services and more hosts, I was dreading we'd end up with a pile of unmanageable firewall rules. Another aspect missing was data encryption when data pass on public Internet links. Was no big deal for us at the moment, but sooner or later everybody starts worrying about this so I took a preemptive shot. Vanilla OpenVPN helped us kill these 2 birds with one stone. We got encryption and once a server has a virtual IP, it's easier to manage firewalls - I choose to manage it on server side (so in our example, on LDAP server). Our dynamic routing script allowed us to have a pair of active-active OpenVPN servers, lack of which would have been a show stopper for me. There are also 2 key benefits of OpenVPN that I like a lot: a. passes through NAT and firewalls (since it's UDP). I can have a machine behind all sorts of firewalls and on network and I still can ssh to it from anywhere in the cloud (using its virtual IP). Works great for VMs with NAT networking type. b. you can assign static virtual IPs to hosts based on ssl key/cert pairs. This comes very handy when you start thinking about Amazon EC2 and their lack of static IP addrs at the moment. 2. Can I connect more than two data center in a pairwise configuration? Yes you can, provided all your hosts that need to connect to VcubeV have physical network connectivity to at least one OpenVPN server (either over LAN or WAN). Plus, at least one OpenVPN server needs to be accessible by the other OpenVPN server. Please see my terrible diagram within the article at . If you want more than 2 OpenVPN servers, please see my (4) below. 3. You mention the downsides are manageable by making certain architectural choices. Could you please describe these? Sure, it's pretty much what I said in the conclusion section in the article. Primarily it's "don't multisource if an app delivers better value when singlesourced." Term "better value" will vary from architect to architect. All of these solutions would require further experimentation. 1. No broadcast or multicast. Solution: look into using OpenVPN on top of `tap' devices instead of `tun'. I personally would not multisource an app that does broadcast or multicast, since it's too low level and imho is likely to have other issues with being deployed in environment which is drastically different from what its designers had in mind. 2. Latency. One depends on public Internet links, so latency can't be controlled. Solution: anticipate latency, application retry logic, adjustable timeouts. If latency is a key aspect of application (trading, for example), don't multisource or at least think twice. 3. Link flapping. Solution: retry logic, avoid long-running TCP connections, forcefully break and re-establish TCP connections regularly, application level heartbeats, use TCP tunnels instead of UDP tunnels, consider data caches (memcached). 4. No more than 2 OpenVPN servers. It's a design limitation of current version of cube-routed. Solution: rewrite cube-routed to share route information using a more advanced protocol that allows many-to-many sharing.

    What Will the Future Look Like?

    It seems clear to me we are going to need a whole new set of tools and infrastructure for managing, deploying, creating, expanding, upgrading, and monitoring applications across multiple clouds. The advantages of multi-cloud deployment are too great to ignore. We need a Data Center API so we can treat all the different clouds as peers and operate on them like one big exposed object instead of individually specialized niches. Will we see real-time markets develop where clouds bid for your network/CPU/storage business and you can dynamically allocate applications to cloud vendors in order to minimize costs?

    Click to read more ...

  • Monday

    Read This Site and Ace Your Next Interview!

    Paul Tyma published a massive and massively good 96 page insider's manual on How to Pass a Silicon Valley Software Engineering Interview. My eyes immediately latched on to one of his key example scenarios, which involves scaling Facebook:


    ● What was Facebook day 1? – A database with a PHP front-end ● In PHP, Java, C#, whatever – How long would it take you to reproduce Facebook's first incarnation? ● A single MySQL instance with some simple queries probably used to happily query the whole userbase.


    ● What is it today? ● Its not about “that stuff you learned in school” – Its about what a company with thousands of (possibly conflicting) queries per second operating on a directed-graph with 50 million nodes ● And of course a few Petabytes of data ● And 99.99% uptime ● Design decision? A Facebook user is (or recently was) currently limited to 5000 friends. If you've been reading all the wisdom contributed to and referenced by this website you might just rock this interview and put a little more money in your pocket. So this site isn't a total waste of time :-) Yet I wonder how we can have 96 pages on interviewing and still not talk about software development at all?

    Click to read more ...


    Two data streams for a happy website

    One of the most important architectural decisions that must be done early on in a scalable web site project is splitting the data flow into two streams: one that is user specific and one that is generic. If this is done properly, the system will be able to grow easily. On the other hand, if the data streams are not separated from the start, then the growth options will be severely limited. Trying to make such a web site scale will be just painting the corpse, and this change will cost a whole lot more when you need to introduce it later (and it is "when" in this case, not "if").

    Click to read more ...


    Product: System Imager - Automate Deployment and Installs

    From their website: SystemImager is software that makes the installation of Linux to masses of similar machines relatively easy. It makes software distribution, configuration, and operating system updates easy, and can also be used for content distribution. SystemImager makes it easy to do automated installs (clones), software distribution, content or data distribution, configuration changes, and operating system updates to your network of Linux machines. You can even update from one Linux release version to another! It can also be used to ensure safe production deployments. By saving your current production image before updating to your new production image, you have a highly reliable contingency mechanism. If the new production enviroment is found to be flawed, simply roll-back to the last production image with a simple update command! Some typical environments include: Internet server farms, database server farms, high performance clusters, computer labs, and corporate desktop environments.

    Related Articles

  • Cluster Admin's article Installing and updating your nodes is an excellent introduction to SystemImager. He says it's fast, scalable, simple, makes it easy to install on running nodes, allows management of different OS images and remote installation on any given group of nodes.
  • Automate Linux installation and recovery with SystemImager by Paul Virijevich

    Click to read more ...

  • Tuesday

    Architecture to Allow High Availability File Upload

    Hi, I was wondering if anyone has found any information on how to architect a system to support high availability file uploads. My scenario: I have an Apache server proxying requests to a bunch of Tomcat Java application servers. When I need to upgrade my site, I stop and upgrade each of the Tomcat servers one at a time. This seems to work well as Apache automatically routes subsequent requests for the stopped app server to the remaining app servers that are up. The problem is that if a user is uploading a file when the app server is stopped, the upload fails and the user has to upload the file again. This is problematic as uploading files is an integral feature of the site and it's frustrating for the users to have to restart their uploads every time I upgrade the site (which I want to be able to do frequently). Has anyone seen any information on how this can be done or have ideas on how this can be architected? I imagine sites like Flickr must have a solution to this problem as I have seen presentations they say that they are able to upgrade their site several times a day without the users noticing. Thanks! Tuyen

    Click to read more ...


    Make Your Site Run 10 Times Faster

    This is what Mike Peters says he can do: make your site run 10 times faster. His test bed is "half a dozen servers parsing 200,000 pages per hour over 40 IP addresses, 24 hours a day." Before optimization CPU spiked to 90% with 50 concurrent connections. After optimization each machine "was effectively handling 500 concurrent connections per second with CPU at 8% and no degradation in performance." Mike identifies six major bottlenecks:

  • Database write access (read is cheaper)
  • Database read access
  • PHP, ASP, JSP and any other server side scripting
  • Client side JavaScript
  • Multiple/Fat Images, scripts or css files from different domains on your page
  • Slow keep-alive client connections, clogging your available sockets Mike's solutions:
  • Switch all database writes to offline processing
  • Minimize number of database read access to the bare minimum. No more than two queries per page.
  • Denormalize your database and Optimize MySQL tables
  • Implement MemCached and change your database-access layer to fetch information from the in-memory database first.
  • Store all sessions in memory.
  • If your system has high reads, keep MySQL tables as MyISAM. If your system has high writes, switch MySQL tables to InnoDB.
  • Limit server side processing to the minimum.
  • Precompile all php scripts using eAccelerator
  • If you're using WordPress, implement WP-Cache
  • Reduce size of all images by using an image optimizer
  • Merge multiple css/js files into one, Minify your .js scripts
  • Avoid hardlinking to images or scripts residing on other domains.
  • Put .css references at the top of your page, .js scripts at the bottom.
  • Install FireFox FireBug and YSlow. YSlow analyze your web pages on the fly, giving you a performance grade and recommending the changes you need to make.
  • Optimize httpd.conf to kill connections after 5 seconds of inactivity, turn gzip compression on.
  • Configure Apache to add Expire and ETag headers, allowing client web browsers to cache images, .css and .js files
  • Consider dumping Apache and replacing it with Lighttpd or Nginx. Find more details in Mike's article.

    Click to read more ...

  • Monday

    Architecture Template Advice Needed

    Here's my template for describing the architecture of a system. The idea is to have people fill out this template and that then becomes the basis for a profile. This is how the Friends for Sale post was created and I think that turned out well. People always want more detail, but realistically you can only expect so much. The template is definitely too long, but it's more just a series of questions to jog people's memories and then they can answer whatever they think is important. What I want to ask is if you can think of any things to add/delete/change in the template? What do you want to know about the systems people are building? So if you have the time, please take a look and tell me what you think.

    Getting to Know You

    * What is the name of your system and where can we find out more about it? * What is your system is for? * Why did you decide to build this system? * How is your project financed? * What is your revenue model? * How do you market your product? * How long have you been working on it? * How big is your system? Try to give a feel for how much work your system does. * Number of unique visitors? * Number of monthly page views? * What is your in/out bandwidth usage? * How many documents, do you serve? How many images? How much data? * How fast are you growing? * What is your ratio of free to paying users? * What is your user churn? * How many accounts have been active in the past month?

    How is your system architected?

    * What is the architecture of your system? Talk about how your system works in as much detail as you feel comfortable with. * What particular design/architecture/implementation challenges does your system have? * What did you do to meet these challenges? * How does your system evolve to meet new scaling challenges? * Do you use any particularly cool technologies are algorithms? * What do you do that is unique and different that people could best learn from? * What lessons have you learned? * Why have you succeeded? * What do you wish you would have done differently? * What wouldn't you change? * How much up front design should you do? * How are you thinking of changing your architecture in the future?

    How is your team setup?

    * How many people are in your team? * Where are they located? * Who performs what roles? * Do you have a particular management philosophy? * If you have a distributed team how do you make that work? * What skillets does your team possess? * What is your development environment? * What is your development process? * Is there anything that you would do different or that you have found surprising?

    What infrastructure do you use?

    * Which languages do you use to develop your system? * How many servers do you have? * How is functionality allocated to the servers? * How are the servers provisioned? * What operating systems do you use? * Which web server do you use? * Which database do you use? * Do you use a reverse proxy? * Do you collocate, use a grid service, use a hosting service, etc? * What is your storage strategy? DAS/SAN/NAS/SCSI/SATA/etc/other? * How much capacity do you have? * How do you grow capacity? * Do you use a storage service? * Do you use storage virtualization? * How do you handle session management? * How is your database architected? Master/slave? Shard? Other? * How do you handle load balancing? * Which web framework/AJAX Library do you use? * Which real-time messaging frame works do you use? * Which distributed job management system do you use? * How do you handle ad serving? * Do you have a standard API to your website? If so, how do you implement it? * If you use a dynamic language which instruction caching product to use? * What is your object and content caching strategy? * What is your client side caching strategy? * Which third party services do you use to help build your system?

    How do you manage your system?

    * How do check global availability and simulate end-user performance? * How do you health check your server and networks? * How you do graph network and server statistics and trends? * How do you test your system? * How you analyze performance? * How do you handle security?

    How do you handle customer support?

    How do you decide what features to add/keep?

    * Do you implement web analytics? * Do you do A/B testing? ! How is your data center setup? * How many data centers do you run in? * How is your system deployed in data centers? * Are your data centers active/active, active/passive? * How do you handle syncing between data centers and fail over and load balancing? * Which firewall product do you use? * Which DNS service do you use? * Which routers do you use? * Which switches do you use? * Which email system do you use? * How do you handle spam? * How do you handle virus checking of email and uploads? * How do you backup and restore your system? * How are software and hardware upgrades rolled out? * How do you handle major changes in database schemas on upgrades? * What is your fault tolerance and business continuity plan? * Do you have a separate operations team managing your website? * Do you use a content delivery network? If so, which one and what for? * How much do you pay monthly for your setup?


    * Who do you admire? * Have you patterned your company/approach on someone else? * Are there any questions you would add/remove/change in this list?

    Click to read more ...