Paper

Todd Hoff's picture

Federation at Flickr: Doing Billions of Queries Per Day

Flickr's lone database guy Dathan Pattishall made his excellent presentation available on how on how Flickr scales its backend to handle tremendous loads. Some of this information is available in Flickr Architecture, but the paper is so good it's worth another read. If you want to see sharding done right, at scale, take a look.

Todd Hoff's picture

Paper: MapReduce: Simplified Data Processing on Large Clusters

With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future.

Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.

One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster.

From the abstract:

Todd Hoff's picture

Paper: On Designing and Deploying Internet-Scale Services

Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.

The paper has too many details to cover here, but the big sections are:

  • Recommendations
  • Automatic Management and Provisioning
  • Dependency Management
  • Release Cycle and Testing
  • Operations and Capacity Planning
  • Graceful Degradation and Admission Control
  • Customer Self-Provisioning and Self-Help
  • Customer and Press Communication Plan

    In the recommendations we see some of our old favorites:

  • Paper: Asynchronous HTTP and Comet architectures

    Comet has popularized asynchronous non-blocking HTTP programming, making it practically indistinguishable from reverse Ajax, also known as server push. This JavaWorld article takes a wider view of asynchronous HTTP, explaining its role in developing high-performance HTTP proxies and non-blocking HTTP clients, as well as the long-lived HTTP connections associated with Comet.

    Todd Hoff's picture

    Paper: Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

    Consistent hashing is one of those ideas that really puts the science in computer science and reminds us why all those really smart people spend years slaving over algorithms. Consistent hashing is "a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots" and was originally a way of distributing requests among a changing population of web servers. My first reaction to the idea was "wow, that's really smart" and I sadly realized I would never come up with something so elegant. I then immediately saw applications for it everywhere. And consistent hashing is used everywhere: distributed hash tables, overlay networks, P2P, IM, caching, and CDNs. Here's the abstract from the original paper and after the abstract are some links to a few very good articles with accessible explanations of consistent hashing and its applications in the real world.

    Todd Hoff's picture

    Paper: Container-based Operating System Virtualization: A Scalable, High-performance Alternative to Hypervisors

    One stumbling block of the the great march towards virtualization is the relatively poor performance of resource hungry applications like databases. We are told to develop and test using VMs, but deploy without them. Which kind of sucks IMHO. Maybe better virtualization technology can remove this split. This paper talks about a different approach to virtualization called "container-based" virtualization that can reportedly double the performance of traditional hypervisor systems like Xen. It does this by trading isolation for efficiency. Rather than maintaining complete isolation between VMs the container approach shares resources between VMs and thus gives higher performance while still guaranteeing strong fault, resource, and security isolation. It's yet another battle in computing's endless war of creating and destroying abstraction layers. I learned a lot from from this paper because of how it compared and contrasted traditional hypervisor and container based virtualization strategies. Good job.

    Todd Hoff's picture

    Paper: Dynamo: Amazon’s Highly Available Key-value Store

    Update 2: Read/WriteWeb has a good article talking about the scalability issues of relational databases and how Dynamo solves them: Amazon Dynamo: The Next Generation Of Virtual Distributed Storage. But since Dynamo is just another frustrating walled garden protected by barbed wire and guard dogs, its relevance is somewhat overstated.

    Update: Greg Linden has a take on the paper where he questions some of Amazon's design choices: emphasizing write availability over fast reads, a lack of indexing support, use of random distribution for load balancing, and punting on some scalability issues.

    Werner Vogels, Amazon's avuncular CTO, just announced a new paper on the internal database technology Amazon uses to handle tens of millions customers. I'll dive into more details later, but I thought you'd want to read it hot off the blog. The bad news is it won't be a service. They are keeping this tech not so secret, but very safe. Happily, it's another real-life example to learn from. As many top websites use a highly tuned key-value database at their core instead of a RDBMS, it's an important technology to understand.

    From the abstract you can get a feel for what the paper is about:

    Todd Hoff's picture

    Paper: Wikipedia's Site Internals, Configuration, Code Examples and Management Issues

    Wikipedia and Wikimedia have some of the best, most complete real-world documentation on how to build highly scalable systems. This paper by Domas Mituzas covers a lot of details about how Wikipedia works, including: an overview of the different packages used (Linux, PowerDNS, LVS, Squid, lighttpd, Apache, PHP5, Lucene, Mono, Memcached), how they use their CDN, how caching works, how they profile their code, how they store their media, how they structure their database access, how they handle search, how they handle load balancing and administration. All with real code examples and examples of configuration files. This is a really useful resource.

    Todd Hoff's picture

    Paper: Standardizing Storage Clusters (with pNFS)

    pNFS (parallel NFS) is the next generation of NFS and its main claim to fame is that it's clustered, which "enables clients to directly access file data spread over multiple storage servers in parallel. As a result, each client can leverage the full aggregate bandwidth of a clustered storage service at the granularity of an individual file." About pNFS StorageMojo says: pNFS is going to commoditize parallel data access. In 5 years we won’t know how we got along without it. Something to watch.

    Todd Hoff's picture

    Paper: Understanding and Building High Availability/Load Balanced Clusters

    A superb explanation by Theo Schlossnagle of how to deploy a high availability load balanced system using mod backhand and Wackamole. The idea is you don't need to buy expensive redundant hardware load balancers, you can make use of the hosts you already have to the same effect. The discussion of using peer-based HA solutions versus a single front-end HA device is well worth the read. Another interesting perspective in the document is to view load balancing as a resource allocation problem. There's also a nice discussion of the negative of effect of keep-alives on performance.

    Paper: Architecture of a Highly Scalable NIO-Based Server

    The article describes the basic architecture of a connection-oriented NIO-based java server. It takes a look at a preferred threading model, Java Non-blocking I/O and discusses the basic components of such a server.

    Todd Hoff's picture

    Paper: Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services

    Abstract: When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.

    Todd Hoff's picture

    Paper: Designing Disaster Tolerant High Availability Clusters

    A very detailed (339 pages) paper on how to use HP products to create a highly available cluster. It's somewhat dated and obviously concentrates on HP products, but it is still good information.

    Table of contents:

    1. Disaster Tolerance and Recovery in a Serviceguard Cluster
    2. Building an Extended Distance Cluster Using ServiceGuard
    3. Designing a Metropolitan Cluster
    4. Designing a Continental Cluster
    5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP
    6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF
    7. Cascading Failover in a Continental Cluster

    Evaluating the Need for Disaster Tolerance
    What is a Disaster Tolerant Architecture?
    Types of Disaster Tolerant Clusters

    Extended Distance Clusters
    Metropolitan Cluster
    Continental Cluster
    Continental Cluster With Cascading Failover

    Disaster Tolerant Architecture Guidelines

    Protecting Nodes through Geographic Dispersion
    Protecting Data through Replication
    Using Alternative Power Sources
    Creating Highly Available Networking
    Disaster Tolerant Cluster Limitations

    Managing a Disaster Tolerant Environment
    Using this Guide with Your Disaster Tolerant Cluster Products

    2. Building an Extended Distance Cluster Using ServiceGuard

    Types of Data Link for Storage and Networking
    Two Data Center Architecture

    Two Data Center FibreChannel Implementations
    Advantages and Disadvantages of a Two-Data-Center Architecture

    Three Data Center Architectures
    Rules for Separate Network and Data Links
    Guidelines on DWDM Links for Network and Data

    3. Designing a Metropolitan Cluster

    Designing a Disaster Tolerant Architecture for use with Metrocluster Products

    Single Data Center
    Two Data Centers and Third Location with Arbitrator(s)

    Additional EMC SRDF Configurations

    Setting up Hardware for 1 by 1 Configurations
    Setting up Hardware for M by N Configurations

    Worksheets

    Disaster Tolerant Checklist
    Cluster Configuration Worksheet
    Package Configuration Worksheet

    Next Steps

    4. Designing a Continental Cluster

    Understanding Continental Cluster Concepts

    Mutual Recovery Configuration
    Application Recovery in a Continental Cluster
    Monitoring over a Wide Area Network
    Cluster Events
    Interpreting the Significance of Cluster Events
    How Notifications Work
    Alerts
    Alarms
    Creating Notifications for Failure Events
    Creating Notifications for Events that Indicate a Return of Service
    Performing Cluster Recovery
    Notes on Packages in a Continental Cluster
    How Serviceguard commands work in a Continentalcluster

    Designing a Disaster Tolerant Architecture for use with Continentalclusters

    Mutual Recovery
    Serviceguard Clusters
    Data Replication
    Highly Available Wide Area Networking
    Data Center Processes
    Continentalclusters Worksheets

    Preparing the Clusters

    Setting up and Testing Data Replication
    Configuring a Cluster without Recovery Packages
    Configuring a Cluster with Recovery Packages

    Building the Continentalclusters Configuration

    Preparing Security Files
    Creating the Monitor Package
    Editing the Continentalclusters Configuration File
    Checking and Applying the Continentalclusters Configuration
    Starting the Continentalclusters Monitor Package
    Validating the Configuration
    Documenting the Recovery Procedure
    Reviewing the Recovery Procedure

    Testing the Continental Cluster

    Testing Individual Packages
    Testing Continentalclusters Operations

    Switching to the Recovery Packages in Case of Disaster

    Receiving Notification
    Verifying that Recovery is Needed
    Using the Recovery Command to Switch All Packages
    How the cmrecovercl Command Works

    Forcing a Package to Start
    Restoring Disaster Tolerance

    Restore Clusters to their Original Roles
    Primary Packages Remain on the Surviving Cluster
    Primary Packages Remain on the Surviving Cluster using cmswitchconcl
    Newly Created Cluster Will Run Primary Packages
    Newly Created Cluster Will Function as Recovery Cluster for All Recovery Groups

    Maintaining a Continental Cluster

    Adding a Node to a Cluster or Removing a Node from a Cluster
    Adding a Package to the Continental Cluster
    Removing a Package from the Continental Cluster
    Changing Monitoring Definitions
    Checking the Status of Clusters, Nodes, and Packages
    Reviewing Messages and Log Files
    Deleting a Continental Cluster Configuration
    Renaming a Continental Cluster
    Checking Java File Versions
    Next Steps

    Support for Oracle RAC Instances in a Continentalclusters Environment

    Configuring the Environment for Continentalclusters to Support Oracle RAC
    Initial Startup of Oracle RAC Instance in a Continentalclusters Environment
    Failover of Oracle RAC Instances to the Recovery Site
    Failback of Oracle RAC Instances After a Failover

    5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP

    Files for Integrating XP Disk Arrays with Serviceguard Clusters
    Overview of Continuous Access XP Concepts

    PVOLs and SVOLs
    Device Groups and Fence Levels

    Creating the Cluster
    Preparing the Cluster for Data Replication

    Creating the RAID Manager Configuration
    Defining Storage Units

    Configuring Packages for Disaster Recovery
    Completing and Running a Metrocluster Solution with Continuous Access XP

    Maintaining a Cluster that uses Metrocluster/CA
    XP/CA Device Group Monitor

    Completing and Running a Continental Cluster Solution with Continuous Access XP

    Setting up a Primary Package on the Primary Cluster
    Setting up a Recovery Package on the Recovery Cluster
    Setting up the Continental Cluster Configuration
    Switching to the Recovery Cluster in Case of Disaster
    Failback Scenarios
    Maintaining the Continuous Access XP Data Replication Environment

    6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF

    Files for Integrating ServiceGuard with EMC SRDF
    Overview of EMC and SRDF Concepts
    Preparing the Cluster for Data Replication

    Installing the Necessary Software
    Building the Symmetrix CLI Database
    Determining Symmetrix Device Names on Each Node

    Building a Metrocluster Solution with EMC SRDF

    Setting up 1 by 1 Configurations
    Grouping the Symmetrix Devices at Each Data Center
    Setting up M by N Configurations
    Configuring Serviceguard Packages for Automatic Disaster Recovery
    Maintaining a Cluster that Uses Metrocluster/SRDF
    Managing Business Continuity Volumes
    R1/R2 Swapping

    Building a Continental Cluster Solution with EMC SRDF

    Setting up a Primary Package on the Primary Cluster
    Setting up a Recovery Package on the Recovery Cluster
    Setting up the Continental Cluster Configuration
    Switching to the Recovery Cluster in Case of Disaster
    Failback Scenarios
    Maintaining the EMC SRDF Data Replication Environment
    R1/R2 Swapping

    7. Cascading Failover in a Continental Cluster

    Overview

    Symmetrix Configuration
    Using Template Files

    Data Storage Setup

    Setting Up Symmetrix Device Groups
    Setting up Volume Groups
    Testing the Volume Groups

    Primary Cluster Package Setup
    Recovery Cluster Package Setup
    Continental Cluster Configuration
    Data Replication Procedures

    Data Initialization Procedures
    Data Refresh Procedures in the Steady State
    Data Replication in Failover and Failback Scenarios

    Todd Hoff's picture

    Paper: Lightweight Web servers

    This paper is a great overview of different lightweight web servers. A lot of websites use lightweight web servers to serve images and static content. YouTube is one example: http://highscalability.com/youtube-architecture.

    So if you need to improve performance consider changing over a different web server for some types of content.

    Todd Hoff's picture

    Paper: Guide to Cost-effective Database Scale-Out using MySQL

    This paper is behind a registration-wall, you can't do anything on the MySQL site without filling out a form of some kind, but it's a short, decent introduction to using MySQL for a good sized website.

    A Quick Hit of What's Inside

    Scale-out vs. Scale Up, Customers using MySQL, Scale-Out Reference Architecture

    Todd Hoff's picture

    Paper: MySQL Scale-Out by application partitioning

    Eventually every database system hit its limits. Especially
    on the Internet, where you have millions of users
    which theoretically access your database simultaneously,
    eventually your IO system will be a bottleneck. [A] promising but more complex solution with nearly no scale-out limits is application partitioning. If
    and when you get into the top-1000 rank on alexa [1], you have to think about such solutions.

    A Quick Hit of What's Inside

    Horizontal application partitioning, Vertical application partitioning, Disk IO calculations, How to partition an entity

    Todd Hoff's picture

    Paper: The Clustered Storage Revolution

    If the clustered file system, clustered storage system, storage virtualization movement is new to you then this is a good intro paper. I's a both vendor puff piece and informative, so it might be worth your time.

    A Quick Hit of What's Inside

    Clustered storage architectures have the ability to pull together two or more storage devices to behave as a single entity. Clustered storage can be broken down into three types:
    * 2-way simple failover clustering
    * Namespace aggregation
    * Clustered storage with a distributed file systems (DFS)

    Todd Hoff's picture

    Paper: Replication Under Scalable Hashing

    Replication Under Scalable Hashing:
    A Family of Algorithms for Scalable Decentralized Data Distribution

    Typical algorithms for decentralized data distribution
    work best in a system that is fully built before it first used;
    adding or removing components results in either extensive
    reorganization of data or load imbalance in the system.
    We have developed a family of decentralized algorithms,
    RUSH (Replication Under Scalable Hashing), that
    maps replicated objects to a scalable collection of storage
    servers or disks. RUSH algorithms distribute objects to
    servers according to user-specified server weighting. While
    all RUSH variants support addition of servers to the system,
    different variants have different characteristics with
    respect to lookup time in petabyte-scale systems, performance
    with mirroring (as opposed to redundancy codes),
    and storage server removal. All RUSH variants redistribute
    as few objects as possible when new servers are
    added or existing servers are removed, and all variants
    guarantee that no two replicas of a particular object are
    ever placed on the same server. Because there is no central
    directory, clients can compute data locations in parallel,
    allowing thousands of clients to access objects on thousands
    of servers simultaneously.

    Syndicate content