How can I learn to scale my project?

Todd Hoff's picture

This is a question asked on the ycombinator list and there are some good responses. I gave a quick response, but I particularly like neilk's knock out of the park insightful answer:

  • Read Cal Henderson's book. (I'd add in Theo's book and Release It! too)
  • The center of your design should be the data store, not a process. You transition the data store from state to state, securely and reliably, in small increments.
  • Avoid globals and session state. The more "pure" your function is, the easier it will be to cache or partition.
  • Don't make your data store too smart. Calculations and renderings should happen in a separate, asynchronous process.
  • The data store should be able to handle lots of concurrent connections. Minimize locking. (Read about optimistic locking).
  • Protect your algorithm from the implementation of the data store, with a helper class or module or whatever. But don't (DO NOT) try to build a framework for any conceivable query. Just the ones your algorithm needs.

    Viewing an application as a series of state transitions instead of a blizzard of actions and events is a way under appreciated design perspective. This is one of they key design approaches for making robust embedded systems. A great paper talking about this sort of stuff is Mission Planning and Execution Within the Mission Data System - an effort to make engineering flight software more straightforward and less prone to error through the explicit modeling of spacecraft state. Another interesting paper is CLEaR: Closed Loop Execution and Recovery High-Level Onboard Autonomy for Rover Operations.

    In general I call these Fact Based Architectures. I'm really glad neilk brought it up.

  • Comments

    Re: How can I learn to scale my project?

    Curious: what is Cal Henderson's book? Title? URL?

    Scale is definitely about the data.
    Many developers, however, focus on the code at far too low a level, worrying about using single or double quotes and other trivial minutia.

    If you trawl through the presentations by the architects of sites like YouTube you find that the challenges of scale are architectural and impact the data store the most - with requirements like shards and partitioning.

    Re: How can I learn to scale my project?

    Cal Henderson 's book is :
    Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications

    http://www.amazon.com/Building-Scalable-Web-Sites-applications/dp/059610...

    Re: How can I learn to scale my project?

    The suggestions you make a all very good. I'd add the following other dimension specially for web sites that cater to large numbers of users:

    * Prepare to scale your app horizontally by splitting your data up by user and mapping groups of users to clusters. This way you can expand indefinitely: for every N users you add another cluster of web+app+db servers.
    * To prepare for this, keep data for each user well separated from other users (don't share gratuitously) and add a user_id or account_id field to each table to make it ease to grab a user's data and move it to a different database.
    * If you have "friend" type of links that connect users with each other, keep that as separate as possible since you'll have to special-case that when you move to multiple clusters.
    * Also, keep the user table as simple as possible as you'll have to replicate that so you can direct each user to the right cluster when they log in.

    Re: How can I learn to scale my project?

    I had recently wrote about the lessons from eBay, Amazon, LinkedIn in one of my recent posts: Architecture You Always Wondered About: Lessons Learned at Qcon post.

    Below is a summary of the main bullets:

    Scalability -- How to Do It Right

    • Asynchronous event-driven design: Avoid as much as possible
      any synchronous interaction with the data or business logic tier. Instead, use
      an event-driven approach and workflow
    • Partitioning/Shards: You need to design your data model so
      that it will fit the partitioning model
    • Parallel execution: Parallel execution should be used to
      get the most out of the available resources. A good place to use parallel
      execution is for processing users requests. In this case multiple instances of
      each service can take the requests from the messaging system and execute them in
      parallel. Another place for parallel processing is using MapReduce for
      performing aggregated requests on partitioned data
    • Replication (read-mostly): In read-mostly scenarios
      (LinkedIN seems to fall into this category well), database replication can help
      load-balance the read load by splitting the read requests among the replicated
      database nodes
    • Consistency without distributed transactions: This was one
      of the hot topics of the conference, which also sparked some discussion during
      one of the panels I participated in. An argument was made that to reach
      scalability you had to sacrifice consistency and handle consistency in your
      applications using things such as optimistic locking and asynchronous
      error-handling. It also assumes that you will need to handle idempotency in your code. My
      argument was that while this pattern addresses scalability, it creates
      complexity and is therefore error-prone. During another panel, Dan
      Pritchett
      argued that there are ways to avoid this level of complexity and
      still achieve the same goal, as I outlined in this
      blog post.

    • Move the database to the background: There was violent
      agreement that the database bottleneck can only be solved if database
      interactions happen in the background.

    Quoting Werner
    Vogel
    again:"To scale: No direct access to the database anymore. Instead
    data access is encapsulated in services (code and data together), with a stable,
    public interface."

    You should also check out the following white paper The Scalability Revolution: From Dead End to Open Road

    In GigaSpaces were delivering a platform that does all that for you so you can try it out.

    HTH
    Nati S.

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.

    Post new comment

    The content of this field is kept private and will not be shown publicly.
    • Web page addresses and e-mail addresses turn into links automatically.
    • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd><div ?=?><p ?=?> <img ?=?> <embed ?=?> <h1 ?=?><h2 ?=?><h3 ?=?>
    • Lines and paragraphs break automatically.
    • Glossary terms will be automatically marked with links to their descriptions
    • You may link to webpages through the weblinks registry

    More information about formatting options

    To combat spam, please enter the code in the image.