Storage Systems for High Scalable Systems presentation

The High Scalable Systems (i.e. Websites) such as: Google, Facebook, Amazon, etc. need high scalable storage system that can deal with huge amount of data with high availability and reliability. Building large systems on top of a traditional RDBMS data storage layer is no longer good enough. This presentation explores the landscape of new technologies available today to augment your data layer to improve performance and reliability.

Remember: All of my presentations contents is open source, please feel free to use it, copy it, and re-distribute it as you want.

Download the presentation


Cheap storage: how backblaze takes matters in hand

Blackblaze blogs about how they built their own storage infrastructure on the cheap to run their cloud backup service. This episode: the hardware.

Sorry, just a link this time.


Scaling MySQL on Amazon Web Services

I've recently started working with a large company who is looking to take one of their heavily utilized applications and move it to Amazon Web Services. I'm not looking to start a debate on the merits of EC2, the decision to move to aws is already made (and is a much better decision than paying a vendor millions to host it).

I've done my reasearch and I'm comfortable with creating this environment with one exception, scaling MySQL. I havent done much work with MySQL, i'm more of an Oracle guy up to now. I'm struggling to determine a way to scale MySQL on the fly in a way so that replication works, the server takes its proper place in line for master candidacy, and the apache servers become aware of it.

So this is really three questions:

1. What are some proven methods of load balancing the read traffic going from apache to MySQL.
2. How do I let the load balancing mechanism know when I scale up / down a new Mysql Server?
3. How to alert the master of the new server and initiate replication in an automated environment?

Personally, I dont like the idea of scaling the databases, but the traffic increases exponentially for three hours a day, and then plummets to almost nothing. So this would provide a significant cost savings.

The only way I've read to manage this sort of scaling I read here on slides 18-25:
Has anyone tried this method and either had success or have scripts available to do this? I try not to remake the wheel when I dont have to. Thanks in advance.


Squarespace Architecture - A Grid Handles Hundreds of Millions of Requests a Month 

I first heard an enthusiastic endorsement of Squarespace streaming from the ubiquitous Leo Laporte on one of his many Twit Live shows. Squarespace as a fully hosted, completely managed environment for creating and maintaining a website, blog or portfolio was of interest to me because they promise scalability and this site doesn't have enough of that. But sadly, since they don't offer a link preserving Drupal import our relationship was not meant to be.

When a fine reader of High Scalability, Brian Egge, (and all my readers are thrifty, brave, and strong) asked me how Squarespace scaled I said I didn't know, but I would try and find out. I emailed Squarespace a few questions and founder Anthony Casalena and Director of Technical Operations Rolando Berrios were kind enough to reply in some detail. The questions were both from Brian and myself. Answers can be found below.

Two things struck me most about Squarespace's approach:

  • They based their system on a memory grid, in this case Oracle Coherence. I'm not aware of too many customer facing systems that have moved to a grid as the backbone of their scalability strategy. It's good to see a successful system visible out in the wild.
  • They use a sort of Private Cloud internally. Everything is highly automated and easy to expand. They scale by adding additional resources like CPUs and disks and the system just adapts without a lot of human fussing involved. Now that's scaling with gas.

    Learn more about how Squarespace has learned how to scale to tens of thousands of customers, hundreds of thousands of signups, and serve hundreds of millions of hits per month.


    The Stats

  • Tens of thousands of customers.
  • Hundreds of thousands of signups.
  • Serves hundreds of millions of hits per month.


  • Java - well supported and an advanced language to work in, and the components out there (Apache Foundation, etc.) are second to none.
  • Tomcat - the stability of the server is extremely impressive.
  • Grid - Oracle Coherence for the re-balancing and caching layers.
  • Storage - Isilon Cluster. This allows them to treat their storage like another "grid" as the storage pool is easily scaled by adding more diskspace.
  • Monetiziation Strategy - charge money. No free customers. Pricing starts at $8/month.
  • Uptime - 99.98%
  • Hosting - Peer1, they do not yet operate in multiple datacenters.
  • Competitors - TypePad and WordPress
  • Hardware - they don't use "commodity nodes" or low cost hardware units. These end up costing more in the long run as datacenter power is extremely expensive.
  • Cacti - a cacti instance is used to graph statistical data which helps see trends over time, predict when a hardware upgrade is necessary, and troubleshoot any problems that do show up.

    Lessons Learned

  • Cache as much as you can and load balance requests intelligently across a cluster.
  • Use an infrastructure that scales automatically merely by adding more resources (CPU, disk).
  • Build a scalable design up front. Make scaling easy by designing the application and infrastructure with scaling in mind.
  • Build a hands-off capable maintenance system. Automate processes. Make them as simple as possible. Monitor programatically so people don't have to.
  • Release code early and often. Running on the latest code means problems can be detected quickly when the problem are small.
  • Keep things simple. Apply simplicity to every part of your infrastructure, including both your software and those of your outside vendors. Examples of this are: Grid for the application infrastructure, Isilon cluster for storage, automation, creating their own tools.
  • Use as few technologies as possible by selecting or building simple, powerful and robust tools.
  • Don't be afraid to implement your own code to ensure simplicity. Build or buy is a huge balancing act.
  • Don't be afraid to spend money on technology that helps you get where you need to go. It can save you months and months of headaches that would have prevented you from working on core functionality.

    Interview Questions and Responses

    They say they run on a grid. I'd be interested to know if they built their own grid?

    Partially. We rely on Oracle's Coherence product for the re-balancing
    and caching layers of our system -- which we consider a real workhorse
    for the "grid" aspects of the system. Each node in our infrastructure
    can handle a hit for any single site on the system. This means that in order to increase capacity, we just increase node count. No site is handled by a single node.

    2. How much traffic they can really handle?

    We've had several customer sites on the front page of Digg on multiple
    occasions, and didn't notice any performance degradation for any of our
    sites. In fact, we didn't even realize the surge happened until we reviewed our traffic reports a few hours later. For 99% of sites out there, Squarespace is going to be sufficient. Even larger sites with millions of inbound hits per day are servable, as the bulk of the traffic serving on those sites is in the media being served.

    3. How do they scale up, and allow for certain sites to become quite busy?

    We've tried to make scaling easy, and the application and infrastructure
    have been designed with scaling in mind. Because of this, we're luckily not
    in a situation where we need to keep getting bigger and beefier hardware to handle more and more traffic -- we try to scale out by supplementing the
    grid. Since we try to cache as much as we can and every server
    participates in handling requests for every site, it's generally just a
    matter of adding another node to the environment.

    We try to apply this simplicity to every part of our infrastructure, both
    with our own software and when deciding on purchases from outside vendors. For instance, we just increased the amount of available storage another few terabytes by adding another node to our Isilon cluster.

    4. Are there any stats you can share about how many customers, how many users, how many requests served, how many servers, how much disk, how fast, how reliable?

    We, unfortunately, can't share these numbers as we're a private company
    -- but we can say we have tens of thousands of customers, hundreds
    of thousands of signups, and serve hundreds of millions of hits per
    month. The server types and disk configurations (RAID, etc) are a bit
    irrelevant, as the clustering we implement provides redundancy -- not
    anything implemented into a particular single machine. Nothing in
    hardware is too particular to our setup. I will say we don't purchase
    "commodity nodes" or other low cost hardware units, as we find these
    end up costing more in the long run as datacenter power is extremely

    5. What technology stack are you using and why did you make the choices you made?

    We currently use Java along with Tomcat as our web server. After
    trying a few other solutions, we really appreciated the ability to use
    as few technologies as possible, and have those always remain things
    that are understandable for us. Java is an incredibly well supported
    and advanced language to work in, and the components out there (Apache
    Foundation, etc.) are second to none. As for Tomcat, the stability of
    the server is extremely impressive. We've implemented our own
    controller mechanisms on top of Tomcat (instead of going with some
    other library) in order to ensure extreme simplicity.

    6. How are you handling...


    As mentioned above, every web node handles traffic for all sites, so a
    customer doesn't have to worry about an underpowered server unable to handle their traffic, or a node going down.


    Backups are obviously important to us, and we have several copies of user
    and server data stored in multiple locations. We gather backups with a
    combination of various home-grown scripts customized for our environment.

    Failover? Monitoring?

    Since this company originally was solely maintained by Anthony when he
    first started it, things needed to be as simple and automated as possible.
    This includes failover and monitoring. Our monitoring systems check every
    aspect of our environment we can think of several times a minute, and can
    restart obviously dead services, or alert us if it's something an
    actual person needs to handle.

    Additionally, we've set up a cacti instance to graph as much statistical
    data as we can pull out of our servers, so we can see trends over time.
    This allows us to easily predict when a hardware upgrade is necessary. It also helps us troubleshoot any problems that do show up.

    Operations? Releases? Upgrades? Add new hardware?

    With our customer base constantly growing, it's getting tough to manage our systems and still keep our workload under control. There are some projects on the road map to move to a much more hands-off maintenance of our environment, including automatic code deployments and system software upgrades. Most operations can be done without taking the grid offline.

    Multiple data centers?

    We do not have multiple data centers, but have some plans in the works to
    roll one out within the next year.


    This is a really broad question, so it's a bit hard to succinctly
    answer. One thing (amongst many) that has consistently served us very
    well is trying to ensure our development environment is always
    releasable into production. By ensuring we're always out there with
    our latest code, we can usually detect problems very rapidly, and
    as a result, those problems are generally extremely small. Everyone on our development team tends to be responsible for wide, sweeping aspects of the system -- which gives them a lot of flexibility to determine how
    their components should work as a whole. It's incredibly important
    that everything fits seamlessly together in the end, so we spend a lot
    of time iterating on things that other groups might consider finished.


    Support is something we take extremely seriously. As we've grown from
    the ground up without an external investor, most of our team members
    are versed in support, and understand how critical this component is.
    Our support staff is completely hired from our community, and is
    incredibly passionate about their jobs. We try and get every single
    customer support inquiry answered within 15 minutes or less, and have all sorts of metrics related to our goals here.

    7. What have you done that's really cool that you think other people could learn from?

    We spend a lot of time internally writing scripts and other
    applications that simply run our business. For instance, our
    persistence layer configuration files are generated by applications
    we've written that read our database model directly from the database.
    We develop a lot of these programs, and a lot of "standard naming"--this, again, means that we can move very rapidly as we have less monotonous tasks and searching to think about.

    While this sort of thing is appropriate for small tasks, for the big
    ones, we also aren't afraid to spend money on well developed
    technology. Some of our choices for load balancing and storage are
    very costly, but end up saving us months and months of time in the
    long haul, as we've avoided having to "put out fires" generated by
    untested home grown solutions. It's a huge balancing act.

    The End

    Often the best way to judge a product is to peruse the developer forums. It's these people who know what's really happening. And when I look I see an almost complete absence of threads about performance, scalability, or reliability problems. Take a look at other CMSs and you'll see a completely different tenor of questions. That says something good about the strength of their scalability strategy.

    I'd really like to thank Squarespace for taking the time and making the effort to share they've learned with the larger community. It's an effort we all benefit from. If you would also like to share your knowledge and wisdom with the world please get in touch and let's get started!

    Related Articles

  • Implementation Focus: Squarespace
  • Are Cloud Based Memory Architectures the Next Big Thing?
  • Up and running on Squarespace by Peter Efland
  • Kevin Rose Comes to Squarespace by D. Atkinson
  • Squarespace Vs Wordpress a thread in their developer forum.
  • Friday

    Strategy: Solve Only 80 Percent of the Problem

    Solve only 80% of a problem. That's usually good enough and you'll not only get done faster, you'll actually have a chance of getting done at all.

    This strategy is given by Amix in HOW TWITTER (AND FACEBOOK) SOLVE PROBLEMS PARTIALLY. The idea is solving 100% of a complex problem can be so hard and so expensive that you'll end up wasting all your bullets on a problem that could have been satisfactoraly solved in a much simpler way.

    The example given is for Twitter's real-time search. Real-time search almost by definition is focussed on recent events. So in the design should you be able to search historically back from the beginning of time or should you just be able to search for recent time periods? A complete historical search is the 100% solution. The recent data only search is the 80% solution. Which should you choose?

    The 100% solution is dramatically more difficult to solve. It requires searching disk in real-time which is a killer. So it makes more sense to work on the 80% problem because it will satisfy most of your users and is much more doable.

    By reducing the amount of data you need to search it's possible to make some simplifying design choices, like using fixed sized buffers that reside completely in memory. With that architecture your streaming searches can be blisteringly fast while returning the most relevant data. Users are happy and you are happy.

    It's not a 100% solution, but it's a good enough solution that works. Sometimes as programmers we are blinded by the glory of the challenge of solving the 100% solution when there's a more reasonable, rational alternative that's almost as good. Something to keep in mind when you are wondering how you'll possibly get it all done. Don't even try.

    Amix has a very good discussion of Twitter and this strategy on his blog.

    Worse is Better

    A Hacker News post discussing this article brought up that this strategy is the same as Richard Gabriel's famous Worse-is-Better paradox which holds: The right thing is frequently a monolithic piece of software, but for no reason other than that the right thing is often designed monolithically. That is, this characteristic is a happenstance. The lesson to be learned from this is that it is often undesirable to go for the right thing first. It is better to get half of the right thing available so that it spreads like a virus. Once people are hooked on it, take the time to improve it to 90% of the right thing.

    Unix, C, C++, Twitter and almost every product that has experienced wide adoption has followed this philosophy.

    Worse-is-Better solutions have the following characteristics:

  • Simplicity - The design must be simple, both in implementation and interface. It is more important for the implementation to be simpler than the interface. Simplicity is the most important consideration in a design.
  • Correctness - The design must be correct in all observable aspects. It is slightly better to be simple than correct.
  • Consistency - The design must not be overly inconsistent. Consistency can be sacrificed for simplicity in some cases, but it is better to drop those parts of the design that deal with less common circumstances than to introduce either implementational complexity or inconsistency.
  • Completeness - The design must cover as many important situations as is practical. All reasonably expected cases should be covered. Completeness can be sacrificed in favor of any other quality. In fact, completeness must be sacrificed whenever implementation simplicity is jeopardized. Consistency can be sacrificed to achieve completeness if simplicity is retained; especially worthless is consistency of interface.

    In my gut I think Worse-is-Better is different than "Solve Only 80 Percent of the Problem" primarily because Worse-is-Better is more about product adoption curves and 80% is more a design heuristic. After some cogitating this seems a false distinction so I have to concluded I'm wrong and have added Worse-is-Better to this post.

    Related Articles

  • Worse Is Better Richard P. Gabriel
  • Lisp: Good News, Bad News, How to Win Big
  • Interesting Hacker News Thread
  • In Praise of Evolvable Systems by Clay Shirky
  • Big Ball of Mud by Brian Foote and Joseph Yoder
  • Wednesday

    Hot Links for 2009-8-26

  • I'm Going To Scale My Foot Up Your Ass - Shut up about scalability, no one is using your app anyway.
  • Multi-Tenant Data Architecture - Microsoft's take on different approaches to multitenancy.
  • Cloud computing rides on spiraling Energy costs - A report by US researchers has shown the increasing cost of power and cooling in the data centre is a driver towards cloud computing.
  • Interview: Apple’s Gigantic New Data Center Hints at Cloud Computing - Companies building centers this big are getting into cloud computing. Running apps in the cloud requires massive infrastructure: Google-size infrastructure.
  • What Does Cloud Computing Actually Cost? An Analysis of the Top Vendors - Amazon is currently the lowest cost cloud computing option overall. At least for production applications that need more than 6.5 hours of CPU/day, otherwise GAE is technically cheaper because it's free until this usage level.
  • no:sql(east) - October 28–30, 2009, Atlanta, GA. Very cute page playing off of SQL syntax.

    New Products and Updates

  • Gear6 Web Cache Virtual Appliance - a feature complete virtual machine (VM) of the Gear6 Web Cache software. It includes all the functionality of the Gear6 Web Cache including simulating Gear6 high density RAM-flash architecture.
  • Seamlessly Extending the Data Center - Introducing Amazon Virtual Private Cloud (VPC) - We have developed Amazon VPC to allow our customers to seamlessly extend their IT infrastructure into the cloud while maintaining the levels of isolation required for their enterprise management tools to do their work.
  • NetApp reveals cloud computing plan, new Data OnTap OS - Our research shows users are very interested in scale-out technology," she said. "What's nice about it is as you add processor and storage resources, you get much higher storage utilization rates and the new scale-out system grows up to 14 petabytes, but it can still be managed in a single array.
  • The Big Cheese: Powerful Version Of Google Search Appliance Can Grow Exponentially.

    Updates to Articles on High Scalability

  • Streamy Explains CAP and HBase's Approach to CAP - We plan to employ inter-cluster replication, with each cluster located in a single DC. Remote replication will introduce some eventual consistency into the system, but each cluster will continue to be strongly consistent. Updated: How Google Serves Data from Multiple Datacenters.

    The fantasy sponsor for this post are those little food kiosks outside Home Depot stores. I love their Fire Dogs. Hot and yummy. I bet most home improvement projects in America are inspired by cravings for one of these little beauties.
  • Monday

    How Google Serves Data from Multiple Datacenters

    Update: Streamy Explains CAP and HBase's Approach to CAP. We plan to employ inter-cluster replication, with each cluster located in a single DC. Remote replication will introduce some eventual consistency into the system, but each cluster will continue to be strongly consistent.

    Ryan Barrett, Google App Engine datastore lead, gave this talk Transactions Across Datacenters (and Other Weekend Projects) at the Google I/O 2009 conference.

    While the talk doesn't necessarily break new technical ground, Ryan does an excellent job explaining and evaluating the different options you have when architecting a system to work across multiple datacenters. This is called multihoming, operating from multiple datacenters simultaneously.

    As multihoming is one of the most challenging tasks in all computing, Ryan's clear and thoughtful style comfortably leads you through the various options. On the trip you learn:

  • The different multi-homing options are: Backups, Master-Slave, Multi-Master, 2PC, and Paxos. You'll also learn how they each fair on support for consistency, transactions, latency, throughput, data loss, and failover.
  • Google App Engine uses master/slave replication between datacenters. They chose this approach in order to provide:
    - lowish latency writes
    - datacenter failure survival
    - strong consistency guarantees.
  • No solution is all win, so a compromise must be made depending on what you think is important. A major Google App Engine goal was to provide a strong consistency model for programmers. They also wanted to be able to survive datacenter failures. And they wanted write performance that wasn't too far behind a typical relational database. These priorities guided their architectural choices.
  • In the future they hope to offer optional models so you can select Paxos, 2PC, etc for your particular problem requirements (Yahoo's PNUTS does something like this).

    There's still a lot more to learn. Here's my gloss on the talk:

    Consistency - What happens happens after you read after a write?

    Read/write data is one of the hardest kinds of data to run across datacenters. Users a expect a certain level of reliability and consistency.

  • Weak - it might be there, might not. Best effort. Like memcached. It's OK to drop for some applications like Voip, live video, and multiplayer games. You care more about where things are now, not where they where. For data this is not good.
  • Eventual - You eventually see the stuff you wrote, just not right away. Email is a good example. You send it but it doesn't arrive right away, but it gets there, eventually. DNS change propagation, SMTP, Amazon S3, SimpleDB, search engine indexing are all of this type. There's a delay after a write when a read won't see what was written, but the writes eventually push through. Still not ideal for data.
  • Strong - The ideal solution for a structured data system. You get what you put it in. Simplest to program against and think about. Any read after a write will return what was written. AppEngine, file systems, Microsoft Azure, and RDBMSes work this way.
  • Once we move data across datacenters what consistency guarantees do we have? We can give up some guarantees, but we should know what we are getting.

    Transactions - Extended form of consistency across multiple operations.

  • Transaction Properties: Correctness, consistency, enforce variants, ACID.
  • Example: bank transaction. Transfer money from A to B. Subtract money from A and add to B. These happen at different times. What happens if another transfer happens for A in-between? What happens if there's a failure? What happens of program reads from A or B? You want guarantees. On a crash will money added to B still be added to B? Will money taken from A still be taken from A? You don't want to lose or create money.
  • When you start operating across datacenters it's even harder to enforce transactions because more things can go wrong and operations have high latency.

    Why Operate in Multiple Datacenters?

  • Sh*t happens - datacenters fail for any number of reasons.
  • Performance - geolocality allows operations to be moved closer to the user. The speed of light limits limits how fast data can be transferred and becomes significant when operating across the world. Going through multiple router hops also slows traffic. So closer is better and you can only be closer if your data is near the user which requires operating in multiple datacenters. CDNs do this for you, especially for more static data. They put data everywhere.

    Why Not Operate in Multiple Datacenters?

  • Operating in a single datacenter is easy: Low cost bandwidth. Low latency. High bandwidth. Easy operations. Easier code.
  • Operating in multiple datacenters is hard: high cost, high latency, low latency, difficult operations, harder code.
  • It's especially hard if you have a read/write structured data system where you accept writes from more than one location. You have consistency problems. Maintaining consistency in the face of the distances and failures is non-trivial.

    Your Different Architecture Options

  • Single Datacenter. Don't bother operating in mutiple datacenters. This is the easiest option and is what most people do. But datacenters fail, you could lose data, and your site could go down.
  • Bunkerize. Create a Maginot Line for the Ultimate Datacenter. Make sure your datacenter doesn't ever go down. SimpleDB and Azure use this strategy.
  • Single Master. Pick a master datacenter that writes go to and other sites replicate to. The replicates sites off read-only services.
    - Better, but not great.
    - Data are usually replicated asynchronously so there's a window of vulnerability for loss.
    - Data in your other datacenters may not be consistent on failure.
    - Popular with financial institutions.
    - You get geolocation to serve reads. Consistency depends on the technique. Writes are still limited to one datacenter.
  • Multi-Master. True multihoming. The Holy Grail. All datacenters are serving reads and writes. All data is consistent. Transactions just work. This is really hard.
    - So some choose to do it with just two datacenters. NASDAQ has two datacenters close together (low latency) and perform a two-phase commit on every transaction, but they have very strict latency requirements.
    - Using more than two datacenters is fundamentally harder. You pay for it with queuing delays, routing delays, speed of light. You have to talk between datacenters. Just fundamentally slower with a smaller pipe. You may pay for with capacity and throughput, but you'll definitely pay in latency.

    How Do You Actually Do This?

    What are the techniques and tradeoffs of different approaches? Here's the evaluation matrix:

      Backups M/S MM 2PC Paxos
    Consistency Weak Eventual Eventual Strong Strong
    Transactions No Full Local Full Full
    Latency Low Low Low High High
    Throughput High High High Low Medium
    Data loss Lots Some Some None None
    Failover Down Read-only Read/Write Read/Write Read/Write

    - M/S = master/slave, MM - multi-master, 2PC - 2 Phase Commit
    - What kind of consistency, transactions, latency throughput do we get for a particular approach? Will we lose data on failure? How much will we lose? When we failover for maintenance or we want to move things, say decommissioning a datacenter, how well do we do that, how well do the techniques support it?

  • Backups - Make a copy of your data that's secret and safe. Generally weak consistency. Usually no transactions. Used for the first internal datastore launch. Not good enough for a production system. Lose data since last backup. You are down while restoring a backup to another datacenter.

  • Master/Slave Replication - Writes to a master are also written to one or more slaves.
    - Replication is asynchronous so good for latency and throughput.
    - Weak/eventual consistency unless you are very careful.
    - You have multiple copies in the datacenters, so you'll lose a little data on failure, but not much. Failover can go read-only until the master has been moved to another datacenter.
    - Datastore currently uses this mechanism. Truly multihoming adds latency because you have to add the extra hop between datacenters. App Engine is already slow on writes so this extra hit would be painful. M/S gives you most of the benefits of better forms while still offering lower latency writes.

  • Multi-Master Replication - support writes from multiple datacenters simultaneously.
    - You figure out how to merge all the writes later when there's a conflict. It's like asynchronous replication, but you are serving writes from multiple locations.
    - Best you can do is Eventual Consistency. Writes don't immediately go everywhere. This is a paradigm shift here. We've assumed with a strongly consistent system that backup and M/S that they don't change anything. They are just techniques to help us multihome. Here it literally changes how the system runs because the multiple writes must be merged.
    - To do the merging you must find away to serialize, impose an ordering on all your writes. There is no global clock. Things happen in parallel. You can't ever know what happens first. So you make it up using timestamps, local timetamps + skew, local version numbers, distributed consensus protocol. This is the magic and there are a number of ways to do it.
    - There's no way to do a global transaction. With multiple simultaneous writes you can't guarantee transactions. So you have to figure out what to do afterward.
    - AppEngine wants strong consistency to make building applications easier, so they didn't consider this option.
    - Failover is easy because each datacenter can handle writes.

  • Two Phase Commit (2PC) - protocol for setting up transactions between distributed systems.
    - Semi-distributed because there's always a master coordinator for a given 2PC transaction. Because there are so few datacenters you tend to go through the same set of master coordinators.
    - It's synchronous. All transactions are serialized through that master which kills your throughput and increases latency.
    - Never serious considered this option because write throughput is very important to them. No single point of failure or serialization point would work for them. Latency is high because of the extra coordination. Writes can be in the 200msec area.
    - This option does work though. You write to all datacenters or nothing. You get strong consistency and transactions.
    - Need N+1 datacenters. If you take one down then you still have N to handle your load.

  • Paxos - A consensus protocol where a group of independent nodes reach a majority consensus on a decision.
    - Protocol: there's a propose step and then an agree step. You only need a majority of nodes to agree to say something is persisted for it to be considered persisted.
    - Unlike 2PC it is fully distributed. There's no single master coordinator.
    - Multiple transactions can be run in parallel. There's less serialization.
    - Writes are high latency because of the 2 extra round coordination trips required in the protocol.
    - Wanted to do this, but the they didn't want to pay the 150msec latency hit to writes, especially when competing against 5msec writes for RDBMSes.
    - They tried using physcially close datacenters but the built-in multi-datacenter overhead (routers, etc) was too high. Even in the same datacenter was too slow.
    - Paxos is still used a ton within Google. Especially for lock servers. For coordinating anything they do across datacenters. Especially when state is moved between datacenters. If your app is serving data in one datacenter and it should be moved to another that coordination is done through Paxos. It's used also in managing memcache and offline processing.


  • Entity Groups are the unit of consistency in AppEngine. Operations are serialized on Entity Groups. The log for each commit to an entity group is replicated. This maintains consistency and provides transactions. Entity Groups are essentially shards. Sharding enables scaling because it allows you to handle a lot of writes. Datastore shards in entity group size chunks. BuddyPoke has 40 million users, each of which has an entity group. That's 40 million different shards.
  • Eating your own dog food is a strategy used a lot at Google. Iterate and make people use new features internally. Using a ton of stuff that's very early. You can iterated many many times so that improves it before you are ready to launch.
  • They see relational databases in the datacenter as their competition as much as Azure and SimpleDB. Inserts into RDBMS are in low milliseconds. Writes into AppEngine are 30-40 msecs. Reads are fast. They like this trade-off because on the web reads vastly out number writes.


    A few things I wondered through the talk. Did they ever consider a distributed MVCC approach? That might be interesting and wasn't addressed as an option. Clearly at Google scale an in-memory data grid isn't yet appropriate.

    A preference for the strong consistency model was repeatedly specified as a major design goal because this makes the job of the programmer easier. A counter to this is that the programming model for Google App Engine is already very difficult. The limits and the lack of traditional relational database programming features put a lot of responsibility back on the programmer to write a scalable app. I wonder if giving up strong consistency would have been such a big deal in comparison?

    I really appreciated the evaluation matrix and the discussion of why Google App Engine made the choices they did. Since writes are already slow on Google App Engine they didn't have a lot of headroom to absorb more layers of coordination. These are the kinds of things developers talk about in design meetings, but they usually don't make it outside the cubicle or conference room walls. I can just hear Ryan, with voiced raised, saying "Why can't we have it all!" But it never seems we can have everything. Many thanks to Ryan for sharing.

    Related Articles

  • Slides for the Talk
  • ZooKeeper - A Reliable, Scalable Distributed Coordination System
  • Yahoo!'s PNUTS Database: Too Hot, Too Cold or Just Right?
  • Paper: Consensus Protocols: Paxos by Henry Robinson
  • Paper: Consensus Protocols: Two-Phase Commit by Henry Robinson
  • Paper: Dynamo: Amazon’s Highly Available Key-value Store
  • Are Cloud Based Memory Architectures the Next Big Thing?
  • Thursday

    VMware to bridge a DMZ.  

    Hey guys,

    There is a renewed push at my organization to deploy vmware...everywhere.

    I am rather excited as I know we have a lot of waste when it comes to resources.

    What has pricked my ears up however, is the notion of using this technology in our very busy public facing DMZ's.

    Today we get lots of spikes of traffic and we are coping very well. 40x HP blades, apache/php/perl/tomcat/ all in HA behind HA F5's and HA Checkpoint FW's. (20 servers in 2 datacentres).

    The idea is, we virtualise these machines, including the firewalls onto hosts vmware clusters that span the public interface to our internal networks. This is something that has gone against the #1 rule I have ever lived by while working on the inet. No airgaps from the unknown to the known!

    I am interested in feedback on this scenario.

    From a resource perspective, our resource requirements in the DMZ will be lowered over time due to business change and we still have a lot of head room in our capacity.

    Do you think this is change for change sake? All I can see is more complexity, higher risk and more skill required to manage what today is a very simple and resilient setup with no security flaws.

    VMware and some big name companies/gov agencies stand by the notion the software dividing the host machine is more than capable are keeping the DMZ's in check. It just doesn't sit well with me, knowing we may have a public facing website on the same host machine which is running a critical safety or customer management tool.

    Apart from the ease of management to grow/shrink (something we don't need todo in any rush), what are the advantages to increase risk and complexity?

    Are any of you in the same position?

    Costs wise - our website costs are minuscule compared to the revenue we generate thru them - Would you risk what is a sound and stable environment because it sounds cool to 'virtualise' or is there something I am missing?

    Kind regards,

    ps. I don't post much on here but I love reading your articles. The website I am referring to in my post hits a peak of $250/second and is responsible for 90% of revenue to the business.


    Dependency Injection and AOP frameworks for .NET 

    We're looking to implement a framework to do Dependency Injection and AOP for a new solution we're working on. It will likely get hit pretty hard, so we'd like to chose a framework that's proven to scale well, and operates well under pressure.

    Right now, we're looking closely at Spring.NET, Castle Project's Windsor framework, and Unity. Does anyone have any feedback on implementing any of these in large, high traffic environments?


    Hardware Architecture Example (geographical level mapping of servers)

    I have put down my thoughts in the architecture discussed in the blog. Although I have done substantial research to understand how things should work before deciding this architecture but I will be requiring huge amount of inputs from everyone to come to an architecture decision. Hardware entities which were thought while designing the entities are:
    1. Master Web Server which will map different users to web servers placed in different geographical locations. (will prefer storing a mapping table in RAM)
    2. Web Servers
    3. Application Servers
    4. Master Database Servers (to implement entity wise look up sharding)
    5. Slave Database Servers.

    Will really appreciate if some good inputs of using Cloud Computing are given and how to go about it against or in addition to the given architecture. Would like to in fact know people's view on when to decide using cloud computing techniques. Looking forward for inputs from the community.