NoSQL: If Only It Was That Easy

In this blog post BJ Clark overviews the available key-value stores that aren't SQL based.
On some projects he yields interesting experiences. Make sure to check out the comments as well.

The post has spread throughout social tiny text services so there is a chance you've already read it, but still worthwhile to put here.


1dbase vs. many and cloud hosting vs. dedicated server(s)?

Me and my partner are making a blueprint for an online webshop service. The purpose of this project is to make webshops available for small company's/ individuals automatically just by creating an account with us. Our webapp can be used to add products/pages/... to the store and we'll handle secure checkout by paypal.

Our app should be scalable and manageable. Because we also want to offer free webshops, the amount of webshops could be +10.000 within a few years. We are building on the Zend framework and are using mysql for database.

From the start we want to build our application for optimal and easy scalability in the future, to avoid a lot changes to our app/database in the future.

Now our questions are:

Should we use?:
* one database for all shops (or limited to X shops );
* one database for each new shop (each having products, orders... tables);

I think both approaches have PRO/CONS. What do you think ? Does anyone has experience with this kind of structure ?

one database: easier to make changes to database layout
multiple databases:more scalable, easier to backup/restore

one database: harder to code because of extra keyfields in tables , slower or more difficult to backup/restore.
multiple databases: Takes longer to push changes to database layout

We are totally not clear on what hosting we should use. Would it be a solution to use a cloud service such as mosso/amazon/gogrid? Or is it better to just start with one dedicated server and expand 'manually' later?

Thanks in advance for your help!


Yahoo!'s PNUTS Database: Too Hot, Too Cold or Just Right?

So far every massively scalable database is a bundle of compromises. For some the weak guarantees of Amazon's eventual consistency model are too cold. For many the strong guarantees of standard RDBMS distributed transactions are too hot. Google App Engine tries to get it just right with entity groups. Yahoo! is also trying to get is just right by offering per-record timeline consistency, which hopes to serve up a heaping bowl of rich database functionality and low latency at massive scale:

We describe PNUTS [Platform for Nimble Universal Table Storage], a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of con-current requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.

Some of the cool things about PNUTS are:

  • They actually talk about the hard problem of how to scale a system to 10 different data centers (each with 1,000+ servers) while supporting secondary indexes, materialized views, the ability to create multiple tables, and hash-organized tables. Multi-datacenter operation is so difficult it's usually ignored. PNUTS is designed specifically to operate in many datacenters with a strongish consistency model, which makes it a very interesting design point.
  • You can subscribe to a reliable ordered stream of updates on a table. This is massively convenient. For many applications numerous processes are tied to data changes and this is normally a pain to implement.
  • The consistency model is a per-record timeline consistency: all replicas of a given record apply all updates to the record in the same order. This provides a consistency model that is between the two extremes of serialized transactions and eventual consistency. Conflicting records can't exist at the same time as is allowed by Dynamo.
  • Supports records, but accepts queries only on individual tables. There's no fixed schema for records and columns can be typed or be blobs. Transactions exist only at the record level. This means like with other NoSQL databases denormalization is the modeling strategy as there's no way to have transactions across tables.
  • The degree of read consistency desired can be specified. Records are versioned. You can ask for the latest record version or allow for potentially stale records.
  • Asynchronous replication is used to ensure low write latency while providing geographic replication. Reads should be fast everywhere and may return older versions, writes should be fast locally (in the same datacenter). [3]
  • A message broker that serves both as the replication mechanism and redo log of the database. The message broker guarantees no replica can receive updates out of order because it provided a reliable, totally ordered message channel. This also means you can't have transactions across tables, consistent joins, or foreign keys. They chose this approach over a gossip mechanism (like Dynamo) because it can be optimized for geographically distant replicas and because replicas do not need to know the location of other replicas.
  • PNUTS is hosted and centrally managed. It was built to reduce the overhead of creating and maintaining new applications rather than every property creating their own system. Failover, adding capacity, resharding, performance isolation and supporting applications with different usage profiles are all completely automated.
  • Predicate queries are supported using a scatter-gather mechanism which sends the query to every relevant storage tablet at once. The are gathered sent back to the client.
  • Performance is 1-10ms/request when caching layers are in place.

    From a system perspective PNUTS offers a lot of the good things: hosted, reliability, lowish latency, automation, scalability, supports many application models, and there's a lot of room to improvement that all applications will be able to take advantage of when available.

    From a programmer perspective the are also a lot of good things: it's hosted so fewer worries, notifications, flexible schemas, ordered records, secondary indexes, lowish latency, strong consistency on a single record, scalability, high write rates, reliability, and range queries over a small set of records.

    Unfortunately Goldilocks still needs to keep searching for just right, though she may be getting closer. From a system perspective Yahoo!'s ideas are good, but they don't help you as the system isn't available for you to use. From a programmer perspective the programmer's job is still way too hard. To be just right programmer's need low latency aggregate operators, complex transactions, scalable counters, automatic relationship management, and all the other features that will help them just buy instant porridge and be done with it.

    Related Articles

  • Anti-RDBMS: A list of distributed key-value stores
  • Details on Yahoo's distributed database by Greg Linden
  • Thoughts on Yahoo's PNUTS distributed database by Marton Trencseni
  • Data Challenges at Yahoo! - Ricardo Baeza-Yates & Raghu Ramakrishnan
    Yahoo! Research
  • PNUTS: Yahoo!'s Hosted Data Serving Platform by Lucian
  • Yahoo’s PNUTS by Henry Robinson. A very thoughtful and informative overview of the paper.
  • How robust are gossip-based communication protocols? by Lorenzo Alvisi et al.
  • Trading consistency for scalability in distributed architectures.
  • Asynchronous View Maintenance for VLSD Databases by Parag Agrawal et al.
  • BigTable
  • Dynamo
  • Are Cloud Based Memory Architectures the Next Big Thing?
  • The Story of Goldilocks and the Three Bears
  • PNUTS - Platform for Nimble Universal Table Storage
  • Friday

    Strategy: Break Up the Memcache Dog Pile 

    Update: Asynchronous HTTP cache validations. A proposed HTTP caching extension: if your application can afford to show slightly out of date content, then stale-while-revalidate can guarantee that the user will always be served directly from the cache, hence guaranteeing a consistent response-time user-experience.

    Caching is like aspirin for headaches. Head hurts: pop a 'sprin. Slow site: add caching. Facebook must have a lot of headaches because they popped 805 memcached servers between 10,000 web servers and 1,800 MySQL servers and they reportedly have a 99% cache hit rate. But what's the best way for you to cache for your application? It's a remarkably complex and rich topic. Alexey Kovyrin talks about one common caching problem called the Dog Pile Effect in Dog-pile Effect and How to Avoid it with Ruby on Rails. Glenn Franxman also has a Django solution in MintCache.

    Data is usually cached because it's too expensive to calculate for every hit. Maybe it's a gnarly SQL query you want to avoid and a little stale data is OK. Or maybe the amount of data you have is simply larger than physical memory on any one machine. Or maybe you have the temerity to write to your database and cause its cache to flush so database caching isn't sufficient at a certain level of scale.

    Typical examples are for caching article vote counts, comment threads, and event streams. One familiar example that bit me hard is displaying the the top N blog articles. Do you want to scan through your entire access log table for every page display? Absolutely not. Especially when the nightly backups are going on and the network is very slow. Not good :-) Yet you still want to update the results every X minutes so the stats stay fresh.

    Data freshness requires a refrigeration truck or an expiry time on your cache entry that causes stats to be periodically recalculated. Now, what happens when your cached data expires and a 1000 requests simultaneously try to recalculate the expensive to calculate data? Database load spikes and the world nearly ends. And since memcached operations are not atomic it's possible stale data could be cached and you'll serve stale data. Which kind of defeats of the purpose of taking load off the data while providing accurate data. So, how do you unpile the dogs?

    No Expire Solution

    If cache items never expire then there can never be a recalculation storm. Then how do you update the data? Use cron to periodically run the calculation and populate the cache. Take the responsibility for cache maintenance out of the application space. This approach can also be used to pre-warm the the cache so a newly brought up system doesn't peg the database.

    The problem is the solution doesn't always work. Memcached can still evict your cache item when it starts running out of memory. It uses a LRU (least recently used) policy so your cache item may not be around when a program needs it which means it will have to go without, use a local cache, or recalculate. And if we recalculate we still have the same piling on issues.

    This approach also doesn't work well for item specific caching. It works for globally calculated items like top N posts, but it doesn't really make sense to periodically cache items for user data when the user isn't even active. I suppose you could keep an active list to get around this limitation though.

    Stale Date Solution

    This solution introduces a stale date in addition to the expiration date. Glen describes it as:

    The first client to request data past the stale date is asked to refresh the data,
    while subsequent requests are given the stale but not-yet-expired data as if it
    were fresh, with the understanding that it will get refreshed in a 'reasonable'
    amount of time by that initial request

    In the memcached FAQ a one key approach is described:

  • Set the cache item expire time way out in the future.
  • Embed the "real" timeout serialized with the value. For example, set the item to timeout in 24 hours, but the embedded timeout might be five minutes in the future.
  • On a get from the cache determine if the stale timeout expired and on expiry immediately set a time in the future and re-store the data as is. This closes down the window of risk.
  • Fetch data from the DB and update the cache with the latest value.

    Alexey describes a different two key approach:
  • Create two keys in memcached: MAIN key with expiration time a bit higher than normal + a STALE key which expires earlier.
  • On a get read STALE key too. If the stale has expired, re-calculate and set the stale key again.

    I dislike embedding meta data with data so I like Alexey's approach a bit better, even though it doubles the key space.

    None of these options prevent the problem for ever happening, but they do greatly reduce the failure window for relatively little cost.

    Related Articles

  • Memcached Tag at High Scalability
  • Caching Makes Your Brain Explode by Craig Ambrose.
  • The Secret to Memcached by Tobias Lütke.
  • Memcached FAQ.
  • Dog-pile Effect and How to Avoid it with Ruby on Rails memcache-client Patch by Alexey Kovyrin.
  • MintCache by Glenn Franxman.
  • Advanced Rails Caching.. on the Edge by Aaron Batalion.
  • Friday

    The Canonical Cloud Architecture 

    Update 2: Elastic Load Balancer and EC2 instance bandwidth. It turns out we are limited by bandwidth and not by CPU. Solution: use DNS Round Robin for two to three HighCPU medium instances.
    Update: The Skinny Straw: Cloud Computing's Bottleneck and How to Address It. For cloud computing, bandwidth to and from the cloud provider is a bottleneck. Solution: Evaluate application architecture and consider application partitioning.

    I'm writing this post as a sort of penance. My sin was getting involved in another mutli-threaded mess of a program that was rife with strange pauses and unexpected errors. I really should have known better. But when APIs choose to make callbacks from some mystery thread pool it's hard to keep things straight. I eventually sobered up and posted all events to a queue so I could make sure the program would work correctly. Doh. I may never know why the .Net console output stopped working, but I'll live with it.

    And that reminded me that I've been meaning to write a post on the standard Cloud Architecture. I've tried to hit all the common architectures at one time or another, but there have been some excellent sources lately on structuring programs in a cloud that people may "know" in the same way I knew what not to do, but when the code hits the editor those thoughts may have hidden like a kid next to a broken cookie jar.

    The easiest way to create a scalable service is to compose the service from other scalable services. This is how Google AppEngine works and is largely how AWS works as well (EC2, S3, SQS, SimpleDB, etc), though AWS also functions as a blank canvas on which you can draw your own designs.

    The canonical cloud architecture that has evolved revolves around dynamically scalable CPUs consuming asynchronous, persistently queued events. We talked about this idea already in Flickr - Do the Essential Work Up-front and Queue the Rest. The cloud is just another way of implementing the same idea.

    Amazon suggests a few applications of the Cloud Architecture as:

  • Processing Pipelines
    - Document processing pipelines – convert hundreds of thousands of documents from Microsoft Word to PDF, OCR millions of pages/images into raw searchable text
    - Image processing pipelines – create thumbnails or low resolution variants of an image, resize millions of images
    - Video transcoding pipelines – transcode AVI to MPEG movies
    - Indexing – create an index of web crawl data
    - Data mining – perform search over millions of records
  • Batch Processing Systems
    - Back-office applications (in financial, insurance or retail sectors)
    - Log analysis – analyze and generate daily/weekly reports
    - Nightly builds – perform nightly automated builds of source code repository every night in parallel
    - Automated Unit Testing and Deployment Testing – Test and deploy and perform automated unit testing (functional, load, quality) on different deployment configurations every night
  • Websites
    - Websites that “sleep” at night and auto-scale during the day
    - Instant Websites – websites for conferences or events (Super Bowl, sports tournaments)
    - Promotion websites
    - “Seasonal Websites” - websites that only run during the tax season or the holiday season (“Black Friday” or Christmas)

    A good list, but after having worked on a seasonal website for taxes AWS is a horrible match. AWS only works on the instance level, so you need a whole instance turned on all the time even when there's no demand. This is a complete waste of money. An AWS model truly based on use combined with an SLA driven dashboard would be very convenient. But on to cases where AWS is a good fit.

    SmugMug's Cloud Architecture

    AWS pioneer Don MacAskill of SmugMug details how they process high-resolution photos and high-definition video use a cloud hosted queuing architecture in SkyNet Lives! (aka EC2 @ SmugMug).

    SkyNet, as you might expect, operates completely without human minders and automatically scales up and down in relation to the work load. Their system has several components:
  • Work Initiators - Work comes in from your website and/or other software subsystems and is queued up for processing in the Queue Service. Work doesn't have to be large requests either. Work can be small independent parts of an overall pipeline. Don't keep state in the Workers. Bundle what you need done into a work request in shoot back into the Queuing Service for processing.
  • Provisioning Service - This is Amazon's infrastructure that allows instances to be automatically scaled up and down in relation to the work load. This will be the major difference between your VPS or typical datacenter setup. There's an API for starting and stopping AMIs and
    mechanisms for automatically configuring and running VMs.
  • Workers - These are the guys that continually pull work off queues and do something interesting with it. For SmugMug the results are stored on S3 but the results could be put in your own database, SimpleDB or whatever.
  • Queuing Service - This is where work is queued for consumption by the workers. SmugMug built their own queuing service, but you could just as easily use Amazon's own SQS. Creating a scalable, distributed, performant, highly available queue service is not easy, so you may want to take a look at a number of different queue product suggestions in Flickr - Do the Essential Work Up-front and Queue the Rest.
  • Controller - This component monitors many variables related to the work flow and decides how many instances of EC2 are necessary based on optimizing a small set of goals. Instances are add and removed as needed.

    Don shares a lot of practical detailia on how to efficiently use AWS, how their queue service works, and how their controller manages to balances minimizing cost while still being responsive to users. Achieving fairness and balance in a queue system can be difficult, but SmugMug appears to have done a good job of that.

    What rocks about queuing architectures is that they are just so damn robust. Work is safe in the queues. A random reboot won't cause a loss. If one component is producing events too fast the queue will buffer up events until they can be processed. New components can be cleanly added and removed from the system at any time. Timing isn't critical. Work is processed when someone gets around to it. Timeouts and retries are unnecessary. Programs are simple loops that block on the queue, do something, persist results, and feed back more parallelizable work requests back into the queue. Very hard to screw up. Compare and contrast to complex multi-threaded system with shared-state.

    Building GrepTheWeb in the Cloud

    Amazon has published a great couple of articles on building a canonical Cloud Architecture: Building GrepTheWeb in the Cloud, Part 1: Cloud Architectures and Building GrepTheWeb in the Cloud, Part 2: Best Practices.

    These are really tight and well written articles so I'll just hit certain high points. The example used is an application called GrepTheWeb. GrepTheWeb searches using a regular expression across millions of web documents. So it's a grep for the web, ah got it now. The idea is to take an unpredictable but possibly large number of search requests, apply the search expression to hundreds of terabytes of documents, and return the results in a reasonable period of time.

    How exactly would you do such a thing? Here's how you do it in the cloud:
  • Amazon S3 for retrieving input datasets and for storing the output dataset
  • Amazon SQS for durably buffering requests acting as a “glue” between controllers
  • Amazon SimpleDB for storing intermediate status, log, and for user data about tasks
  • Amazon EC2 for running a large distributed processing Hadoop cluster on-demand
  • Hadoop for distributed processing, automatic parallelization, and job scheduling

    Clearly these are all (except for Hadoop) built on Amazon services, but the general ideas apply anywhere. For storing large amounts of data and accessing it efficiently in parallel you need a distributed file system like S3. To coordinate and dispatch work you need a queuing service like SQS. For keeping intermediate state you need a scalable database store like SimpleDB, though you could also imagine using S3. For dynamically scaling processing nodes something like EC2 is necessary. And for actually carrying out the document search a framework like Hadoop provides a lot of features, though you can imagine using other compute grid products.

    Here's their fabulous picture of what the system looks like:

    All the parts and linkages are described in the paper. What's important to note is that even though there are a lot of independently moving parts all the boundaries are clear and well described. In your typical program few will have any idea how it works. Using Cloud Architecture principles it's possible to create a system which both scales and easy to understand and explain.

    The paper makes several key architectural recommendations:
  • Use Scalable Ingredients - Ensure that your application is scalable by designing each component to be scalable on its own. If every component implements a service interface, responsible for its own scalability in all appropriate dimensions, then the overall system will have a scalable base.
  • Have Loosely Coupled Systems - For better manageability and high-availability, make sure that your components are loosely coupled. The key is to build components without having tight dependencies between each other, so that if one component were to die (fail), sleep (not respond) or remain busy (slow to respond) for some reason, the other components in the system are built so as to continue to work as if no failure is happening.
  • Think Parallel - Implement parallelization for better use of the infrastructure and for performance. Distributing the tasks on multiple machines, multithreading your requests and effective aggregation of results obtained in parallel are some of the techniques that help exploit the infrastructure.
  • Utilize On-Demand Requisition and Relinquishment - After designing the basic functionality, ask the question “What if this fails?” Use techniques and approaches that will ensure resilience. If any component fails (and failures happen all the time), the system should automatically alert, failover, and re-sync back to the “last known state” as if nothing had failed.
  • Use Designs that Are Resilient to Reboot and Re-Launch - Don’t forget the cost factor. The key to building a cost-effective application is using on-demand resources in your design. It’s wasteful to pay for infrastructure that is sitting idle.

    All good stuff which is why I like this paper so much. There's a big conceptual shift here, especially of you are used to relatively simple client-server and N-tier systems. It's like simulating in your mind how to keep an army of ants all working independently while still communicating, coordinating, and making progress on a goal. We implemented similar architecture in datacenters long before the cloud, it was just a lot harder as everything was roll your own. The cloud makes all the necessary components standard, featureful, and relatively inexpensive. This opens any application to completley different ways of structuring their backends than they did in the past.

    Related Articles

  • SkyNet Lives! (aka EC2 @ SmugMug).
  • Flickr - Do the Essential Work Up-front and Queue the Rest
  • Hadoop
  • GridGain: One Compute Grid, Many Data Grids
  • Building GrepTheWeb in the Cloud, Part 1: Cloud Architectures
  • Building GrepTheWeb in the Cloud, Part 2: Best Practices.

  • Thursday

    An Unorthodox Approach to Database Design : The Coming of the Shard

    Update 4: Why you don’t want to shard. by Morgon on the MySQL Performance Blog. Optimize everything else first, and then if performance still isn’t good enough, it’s time to take a very bitter medicine.
    Update 3: Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes by Dare Obasanjo. Excellent discussion of why and when you would choose a sharding architecture, how to shard, and problems with sharding.
    Update 2: Mr. Moore gets to punt on sharding by Alan Rimm-Kaufman of 37signals. Insightful article on design tradeoffs and the evils of premature optimization. With more memory, more CPU, and new tech like SSD, problems can be avoided before more exotic architectures like sharding are needed. Add features not infrastructure. Jeremy Zawodny says he's wrong wrong wrong. we're running multi-core CPUs at slower clock speeds. Moore won't save you.
    Update: Dan Pritchett shares some excellent Sharding Lessons: Size Your Shards, Use Math on Shard Counts, Carefully Consider the Spread, Plan for Exceeding Your Shards

    Once upon a time we scaled databases by buying ever bigger, faster, and more expensive machines. While this arrangement is great for big iron profit margins, it doesn't work so well for the bank accounts of our heroic system builders who need to scale well past what they can afford to spend on giant database servers. In a extraordinary two article series, Dathan Pattishall, explains his motivation for a revolutionary new database architecture--sharding--that he began thinking about even before he worked at Friendster, and fully implemented at Flickr. Flickr now handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost.

    What is sharding and how has it come to be the answer to large website scaling problems?

    Information Sources

    What is sharding?

    While working at Auction Watch, Dathan got the idea to solve their scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one server and the data for User B is stored on another server. It's a federated model. Groups of 500K users are stored together in what are called shards.

    The advantages are:

  • High availability. If one box goes down the others still operate.
  • Faster queries. Smaller amounts of data in each user group mean faster querying.
  • More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
  • You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.

    How is sharding different than traditional architectures?

    Sharding is different than traditional database architecture in several important ways:

  • Data are denormalized. Traditionally we normalize data. Data are splayed out into anomaly-less tables and then joined back together again when they need to be used. In sharding the data are denormalized. You store together data that are used together.

    This doesn't mean you don't also segregate data by type. You can keep a user's profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole. This is a very fast approach. You just get a blob and store a blob. No joins are needed and it can be written with one disk write.

  • Data are parallelized across many physical instances. Historically database servers are scaled up. You buy bigger machines to get more power. With sharding the data are parallelized and you scale by scaling out. Using this approach you can get massively more work done because it can be done in parallel.

  • Data are kept small. The larger a set of data a server handles the harder it is to cash intelligently because you have such a wide diversity of data being accessed. You need huge gobs of RAM that may not even be enough to cache the data when you need it. By isolating data into smaller shards the data you are accessing is more likely to stay in cache.

    Smaller sets of data are also easier to backup, restore, and manage.

  • Data are more highly available. Since the shards are independent a failure in one doesn't cause a failure in another. And if you make each shard operate at 50% capacity it's much easier to upgrade a shard in place. Keeping multiple data copies within a shard also helps with redundancy and making the data more parallelized so more work can be done on the data. You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard. If one server goes down the other can take over.

  • It doesn't use replication. Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master.

    Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled.

    Sharding cleanly and elegantly solves the problems with replication.

    Some Problems With Sharding

    Sharding isn't perfect. It does have a few problems.

  • Rebalancing data. What happens when a shard outgrows your storage and needs to be split? Let's say some user has a particularly large friends list that blows your storage capacity for the shard. You need to move the user to a different shard.

    On some platforms I've worked on this is a killer problem. You had to build out the data center correctly from the start because moving data from shard to shard required a lot of downtime.

    Rebalancing has to be built in from the start. Google's shards automatically rebalance. For this to work data references must go through some sort of naming service so they can be relocated. This is what Flickr does. And your references must be invalidateable so the underlying data can be moved while you are using it.

  • Joining data from multiple shards. To create a complex friends page, or a user profile page, or a thread discussion page, you usually must pull together lots of different data from many different sources. With sharding you can't just issue a query and get back all the data. You have to make individual requests to your data sources, get all the responses, and the build the page. Thankfully, because of caching and fast networks this process is usually fast enough that your page load times can be excellent.

  • How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user's media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions.

  • Less leverage. People have experience with traditional RDBMS tools so there is a lot of help out there. You have books, experts, tool chains, and discussion forums when something goes wrong or you are wondering how to implement a new feature. Eclipse won't have a shard view and you won't find any automated backup and restore programs for your shard. With sharding you are on your own.

  • Implementing shards is not well supported. Sharding is currently mostly a roll your own approach. LiveJournal makes their tool chain available. Hibernate has a library under development. MySQL has added support for partioning. But in general it's still something you must implement yourself.

    See Also

  • The Flickr Architecture for more interesting ideas on how to implement sharding.
  • The Google Arhitecture.
  • The LiveJournal Architecture. They talk quite a bit about their sharding approach and give a lot of helpful details.
  • The Shard category.
  • Wednesday

    Stack Overflow Architecture

    Update 2: Stack Overflow Architecture Update - Now At 95 Million Page Views A Month
    Update: Startup – ASP.NET MVC, Cloud Scale & Deployment shows an interesting alternative approach for a Windows stack using ServerPath/GoGrid for a dedicated database machine, elastic VMs for the front end, and a free load balancer.

    Stack Overflow is a much loved programmer question and answer site written by two guys nobody has ever heard of before. Well, not exactly. The site was created by top programmer and blog stars Jeff Atwood and Joel Spolsky. In that sense Stack Overflow is like a celebrity owned restaurant, only it should be around for a while. Joel estimates 1/3 of all the programmers in the world have used the site so they must be serving up something good.

    I fell in deep like with Stack Overflow for purely selfish reasons, it helped me solve a few difficult problems that were jabbing my eyes out with pain. I also appreciate their no-apologies anthropologically based design philosophy. Use design to engineer in the behaviours you want to encourage and minimize the responses you want to discourage. It's the conscious awareness of the mechanisms that creates such a satisfying synergy.

    What is key about the Stack Overflow story for me is the strong case they make for scale up as a viable solution for a certain potentially large class of problems. The publicity these days is all going scale out using NoSQL databases.

    If you need to Google scale then you really have no choice but to go the NoSQL direction. But Stack Overflow is not Google and neither are most sites. When thinking about your design options keep Stack Overflow in mind. In this era of multi-core, large RAM machines and advances in parallel programming techniques, scale up is still a viable strategy and shouldn't be tossed aside just because it's not cool anymore. Maybe someday we'll have the best of both worlds, but for now there's a big painful choice to be made and that choice decides your fate.

    Joel boasts that for 1/10 the hardware they have performance comparable to similarly size sites. He wonders if these other sites have good programmers. Let's see how they did it and you be the judge.


    The Stats

  • 16 million page views a month
  • 3 million unique visitors a month (Facebook reaches 77 million unique visitors a month)
  • 6 million visits a month
  • 86% of traffic comes from Google
  • 9 million active programmers in the world and 30% have used Stack Overflow.
  • Cheaper licensing was attained through Microsoft's BizSpark program. My impression is they pay about $11K for OS and SQL licensing.
  • Monitization strategy: unobtrusive adds, job placement ads, DevDays conferences, extend the software to target other related niches (Server Fault, Super User), develop StackExchange as a white label and self hosted version of Stack Overflow, and perhaps develop some sort of programmer rating system.


  • Microsoft ASP.NET MVC
  • SQL Server 2008
  • C#
  • Visual Studio 2008 Team Suite
  • JQuery
  • LINQ to SQL
  • Subversion
  • Beyond Compare 3
  • VisualSVN 1.5
  • Web Tier
    - 2 x Lenovo ThinkServer RS110 1U
    - 4 cores, 2.83 Ghz, 12 MB L2 cache
    - 500 GB datacenter hard drives, mirrored
    - 8 GB RAM
    - 500 GB RAID 1 mirror array
  • Database Tier
    - 1 x Lenovo ThinkServer RD120 2U
    - 8 cores, 2.5 Ghz, 24 MB L2 cache
    - 48 GB RAM
  • A fourth server was added to run All together the servers also run Stack Overflow, Server Fault, and Super User.
  • QNAP TS-409U NAS for backups. Decided not to use a cloud solution because the bandwidth costs of transferring 5 GB of data per day becomes prohibitive.
  • Hosting at Impressed with their detailed technical responses and reasonable hosting rates.
  • SQL Server's full text search is used extensively for the site search and detecting if a question has already been asked. is considered an attractive alternative.

    Lessons Learned

    This is a mix of lessons taken from Jeff and Joel and comments from their posts.

  • If you’re comfortable managing servers then buy them. The two biggest problems with renting costs were: 1) the insane cost of memory and disk upgrades 2) the fact that they [hosting providers] really couldn’t manage anything.
  • Make larger one time up front investments to avoid recurring monthly costs which are more expensive in the long term.
  • Update all network drivers. Performance went from 2x slower to 2x faster.
  • Upgrading to 48GB RAM required upgrading MS Enterprise edition.
  • Memory is incredibly cheap. Max it out for almost free performance. At Dell, for example, upgrading from 4G memory to 128G is $4378.
  • Stack Overflow copied a key part of the Wikipedia database design. This turned out to be a mistake which will need massive and painful database refactoring to fix. The refactorings will be to avoid excessive joins in a lot of key queries. This is the key lesson from giant multi-terabyte table schemas (like Google’s BigTable) which are completely join-free. This is significant because Stack Overflow's database is almost completely in RAM and the joins still exact too high a cost.
  • CPU speed is surprisingly important to the database server. Going from 1.86 GHz, to 2.5 GHz, to 3.5 GHz CPUs causes an almost linear improvement in typical query times. The exception is queries which don’t fit in memory.
  • When renting hardware nobody pays list price for RAM upgrades unless you are on a month-to-month contract.
  • The bottleneck is the database 90% of the time.
  • At low server volume, the key cost driver is not rackspace, power, bandwidth, servers, or software; it is NETWORKING EQUIPMENT. You need a gigabit network between your DB and Web tiers. Between the cloud and your web server, you need firewall, routing, and VPN devices. The moment you add a second web server, you also need a load balancing appliance. The upfront cost of these devices can easily be 2x the cost of a handful of servers.
  • EC2 is for scaling horizontally, that is you can split up your work across many machines (a good idea if you want to be able to scale). It makes even more sense if you need to be able to scale on demand (add and remove machines as load increases / decreases).
  • Scaling out is only frictionless when you use open source software. Otherwise scaling up means paying less for licenses and a lot more for hardware, while scaling out means paying less for the hardware, and a whole lot more for licenses.
  • RAID-10 is awesome in a heavy read/write database workload.
  • Separate application and database duties so each can scale independently of the other. Databases scale up and the applications scale out.
  • Applications should keep state in the database so they scale horizontally by adding more servers.
  • The problem with a scale up strategy is a lack of redundancy. A cluster ads more reliability, but is very expensive when the individual machines are expensive.
  • Few applications can scale linearly with the number of processors. Locks will be taken which serializes processing and ends up reducing the effectiveness of your Big Iron.
  • With larger form factors like 7U power and cooling become critical issues. Using something between 1U and 7U might be easier to make work in your data center.
  • As you add more and more database servers the SQL Server license costs can be outrageous. So by starting scale up and gradually going scale out with non-open source software you can be in a world of financial hurt.

    It's true there's not much about their architecture here. We know about their machines, their tool chain, and that they use a two-tier architecture where they access the database directly from the web server code. We don't know how they implement tags, etc. If interested you'll be able to glean some of this information from an explanation of their schema.


    As an architecture profile candidate Stack Overflow has earned two important HighScalability badges: the Microsoft Stack Badge and and the Scale Up Badge. Both controversial and interesting topics of discussion.

    Microsoft Stack Badge

    The Microsoft Stack Badge was earned because Stack Overflow uses the entire Microsoft Stack: OS, database, C#, Visual Studio, and ASP .NET. People are always interested in how MS compares to LAMP, but I don't have many case studies to show them.

    Markus Frind of Plenty of Fish fame is often used as a Microsoft stack poster child, but since he explicitly uses as little of the stack as possible he's not really a good example. Stack Overflow on the other hand is brash in proclaiming their love for MS, even when that love is occasionally spurned.

    It's hard to separate out the Microsoft stack and the scale up approach because for licensing reasons they tend to go together. If you find yourself in the position of transitioning from scale up to scale out by adding dozens of cores, MS licensing will bite you.

    Licensing aside I personally find C#, Visual Studio, and .Net a very productive environment. C#/.Net is at least as good as Java/JVM. ASP .NET has always been a confusing mess to me. The knock against SQL Server is you have to pay for it and if that doesn't bother you then it's a solid choice. The Windows OS may not be as solid as other alternatives but it works well enough.

    So for a scale up solution a Microsoft stack works, especially if you are already Windows centric.

    Scale Up Badge

    This won't be a reenactment of the scale out vs scale up vs rent vs buy wars. For a thorough discussion of these issues please take a look at Scaling Up vs. Scaling Out and Server Hosting — Rent vs. Buy?. If you aren't confused and if your head doesn't hurt after reading all that then you haven't properly understood the material :-)

    The Scale Up Badge was awarded because Stack Overflow uses a scale up strategy to meet their scaling requirements. When they reach a limit they scale vertically by buying a bigger machine and adding more memory.

    Stack Overflow is in the sweet spot for scale up. It's not too large, but with an Alexa ranking of 1,666 and 16 million page views a month it's still a substantial site. Not Google scale, and probably will never have to be, but those are numbers many sites would be thrilled to have. Yet they aren't uploading large amounts of media. They aren't dealing with billions of tweets across complex social networks with millions of users. Their number of users is self limiting. And there are still directions they can take if they need to scale (caching, more web servers, faster disks, more denormalization, more memory, some partitioning, etc). All-in-all it's a well done and very useful two-tier CRUD application.

    NoSQL is Hard

    So should Stack Overflow have scaled out instead of up, just in case?

    What some don't realize is NoSQL is hard. Relational databases have many many faults, but they make a lot of common tasks simple while hiding both the cost and complexity. If you want to know how many black Prius cars are in inventory, for example, then that's pretty easy to do.

    Not so with most NoSQL databases (I'll speak generally here, some NoSQL databases have more features than others). You would have program a counter of black Prius cars yourself, up front, in code. There are no aggregate operators. You must maintain secondary indexes. There's no searching. There are no distributed queries across partitions. There's no Group By or Order By. There are no cursors for easy paging through result sets. Returning even 100 large records at time may timeout. There may be quotas that are very restrictive because they must limit the amount of IO for any one operation. Query languages may lack expressive power.

    The biggest problem of all is that transactions can not span arbitrary boundaries. There are no ACID guarantees beyond a single record or small entity group. Once you wrap your head around what this means for the programmer it's not a pleasant prospect at all. References must be manually maintained. Relationships must be manually maintained. There are no cascading deletes that act correctly during a failure. Every copy of denormalized data must be manually tracked and updated taking into account the possibility of partial failures and externally visible inconsistency.

    All this functionality must be written manually by you in your code. While flexibility to write your own code is great in an OLAP/map-reduce situation, declarative approaches still cover a lot of ground and make for much less brittle code.

    What you gain is the ability to write huge quantities of data. What you lose is complacency. The programmer must be very aware at all times that they are dealing with a system where it costs a lot to perform distribute operations and failure can occur at anytime.

    All this may be the price of building a truly scalable and distributed system, but is this really the price you want to pay?

    The Multitenancy Problem

    With StackExchange Stack Overflow has gone into the multi-tenancy business. They are offering StackExchange either self-hosted or as a hosted white label application.

    It will be interesting to see if their architecture can scale to handle a large number of sites. Salesorce is the king of multitenancy and although it's true they use Oracle as their database, they basically use very little of Oracle and have written their own table structure, indexing and query processor on top of Oracle. All in order to support multitenancy.

    Salesforce went extreme because supporting a lot of different customers is way more difficult than it seems, especially once you allow customization and support versioning.

    Clearly all customers can't run in one server for security, customization, and scaling reasons.

    You may think just create a database for each customer, share a server for a certain number of customers, and then add more servers as needed. As long as a customer doesn't need more than one server you are golden.

    This doesn't seem to work well in practice. Oddly database managers aren't optimized for adding or updating databases. Creating databases is a heavyweight operation and can degrade performance for existing customers as system locks are taken. Upgrade issues are also problematic. Adding columns locks tables which causes problems in high traffic situations. Adding new indexes can also take a very long time and degrade performance. Plus each customer will likely have specializations that makes upgrading even more complicated.

    To get around these problems Salesforce's Craig Weissman, Chief Architect, created an innovative approach where tables are not created for each customer. All data from all customers is mapped into the same data table, including indexes. The schema for that table looks something like orgid, oid, value0, value1...value500. "orgid" is the organization ID and is how data is never mixed up. It's a very wide and sparse table, which Oracle seems to handle well. Hundreds and hundreds of "tables" and custom fields are mapped into the data table.

    With this approach Salesforce has no option other than to build their own infrastructure to interpret what's in that table. Oracle is left to handle transactions, concurrency, and deadlock detection. The advatange is because there's an interpreted layer handling versions and upgrades is relatively simple because the handling logic can be baked in. Strange but true.

    Related Articles

    This list includes a number of posts by Jeff as he chronicles their journey with Stack Overflow. Jeff is wonderful about being open about what they are doing and why. The comment threads are often tremendous. There's a lot to learn.

  • Learning from by Joel Spolsky
  • Scaling Up vs. Scaling Out: Hidden Costs by Jeff Atwood
  • What Was Stack Overflow Built With?
  • New Stack Overflow Server Glamour Shots
  • New Stack Overflow Servers Ready
  • Server Hosting — Rent vs. Buy? - this is a very informative discussion the pros and cons of renting vs buying.
  • Rent vs. Buy (or EC2 vs. building your own iron) by Michael Friis
  • Oh, You Wanted "Awesome" Edition - We recently upgraded our database server to 48 GB of memory -- because hardware is cheap, and programmers are expensive.
  • Our Backup Strategy - Inexpensive NAS
  • The Economics of Bandwidth
  • Understanding the StackOverflow Database Schema by Brent Ozar
  • Server Speed Tests - new hardware 2x slower - it was the network.
  • ASP.NET MVC: A New Framework for Building Web Applications
  • Three key things to know about moving MySQL into the cloud by morgan
  • NoSQL Conference
  • Decline of the Enterprise Data Warehouse by Bradford Stephens
  • Webinar: Multitenant Magic - Under the Covers of the Data Architecture by Craig Weissman, Chief Architect,
  • Wednesday

    Anti-RDBMS: A list of distributed key-value stores

    Update 8: Introducing MongoDB by Eliot Horowit

    Update 7: The Future of Scalable Databases by Robin Mathew.

    Update 6: NoSQL : If Only it Was that Easy. BJ Clark lays down the law on which databases are scalable: Tokyo - NO, Redis - NO, Voldemort - YES, MongoDB - Not Yet, Cassandra - Probably, Amazon S3 - YES * 2, MySQL - NO. The real thing to point out is that if you are being held back from making something super awesome because you can’t choose a database, you are doing it wrong.
    Update 5: Exciting stuff happening in Japan at this Key-Value Storage meeting in Tokyo. Presentations on Groonga, Senna, Lux IO, Tokyo-Cabinet, Tx, repcached, Kai, Cagra, kumofs, ROMA, and Flare.

    Update 4: NoSQL and the Relational Model: don’t throw the baby out with the bathwater by Matthew Willson. So my key point is, this kind of modelling is WORTH DOING, regardless of which database tool you end up using for physical storage.
    Update 3: Choosing a non-relational database; why we migrated from MySQL to MongoDB. An illuminating article explaining why Boxed Ice move to MongoDB over MySQL and other NoSQL options: easy install, PHP support, replication and master-master support, good doc, auto sharding on the road map. They still use MySQL for billing.
    Update 2: They are now called NoSQL databases. So keep up! Eric Lai wrote a good article in Computerworld No to SQL? Anti-database movement gains steam about the phenomena. There was even a NoSQL conference. It was unfortunately full by the time I wanted to sign up, but there are presentations by all the major players. Nice Hacker News thread too.
    Update: Some Notes on Distributed Key Stores by Leonard Lin. What's the best way to handle a fast growing system with 100M items that requires low latency and lots of inserts? Leanord takes a trip through several competing systems. The winner was: Tokyo Cabinet.

    Richard Jones has put together a very nice list of various key-value stores around the internets. The list includes: Project Voldemort, Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Cassandra, HBase, and Hypertable. Richard also includes some commentary and their basic components (language, fault tolerance, persistence, client protocol, data model, docs, community).

    There's an excellent discussion in the comments of Paxos vs Vector Clock techniques for synchronizing writes in the face of network failures.


    Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

    This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application.

    During the tutorial, we will perform the following data processing steps.... read more on Cloudera website


    15 Scalability and Performance Best Practices

    These are from Laura Thomson of OmniTi:

    1. Profile early, profile often. Pick a profiling tool and learn it in and out. 
    2. Dev-ops cooperation is essential. The most critical difference in organizations that handles crises well.
    3. Test on production data. Code behavior (especially performance) is often data driven.
    4. Track and trend. Understanding your historical performance characteristics is essential for spotting emerging problems.
    5. Assumptions will burn you. Systems are complex and often break in unexpected ways.
    6. Decouple. Isolate performance failures.
    7. Cache. Caching is the core of most optimizations.
    8. Federate. Data federation is taking a single data set and spreading it across multiple database/application servers.
    9. Replicate. Replication is making synchronized copies of data available in more than one place.
    10. Avoid straining hard-to-scale resources. Some resources are inherently hard to scale: Uncacheable’ data, Data with a very high read+write rate, Non-federatable data, Data in a black-box
    11. Use a compiler cache. A compiler cache sits inside the engine and caches the parsed optrees.
    12. Be mindful of using external data sources. External data (RDBMS, App Server, 3rd Party data feeds) are the number one cause of application bottlenecks.
    13. Avoid recursive or heavy looping code. Deeply recursive code is expensive in PHP.
    14. Don’t Outsmart Yourself . Don’t try to work around perceived inefficiencies in PHP (at least not in userspace code!)
    15. Build with caching in mind. Caching is the most important tool in your tool box.