Weblinks: Paper

Categories

Links in this category and its subcategories

Todd Hoff's picture

A High Performance Memory Database for Web Application Caches

Abstract—This paper presents the architecture and
characteristics of a memory database intended to be used as a
cache engine for web applications. Primary goals of this database
are speed and efficiency while running on SMP systems with
several CPU cores (four and more). A secondary goal is the
support for simple metadata structures associated with cached
data that can aid in efficient use of the cache. Due to these goals,
some data structures and algorithms normally associated with
this field of computing needed to be adapted to the new
environment.

Todd Hoff's picture

A Scalable, Commodity Data Center Network Architecture

Looks interesting...

Abstract:
Today’s data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Nonuniform bandwidth among data center nodes complicates application design and limits overall system performance.
In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today’s higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.

Todd Hoff's picture

FastBit: An Efficient Compressed Bitmap Index Technology

Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times.


FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing methods, bitmap indices are superior because they can be efficiently combined to answer multi-dimensional queries whereas other optimal methods can not.

It's not all just map-reduce and add more servers until your attic is full.

Related Articles


  • FastBit: Digging through databases faster. An excellent description of how FastBit works, especially compared to b-trees.

  • Todd Hoff's picture

    Federation at Flickr: Doing Billions of Queries Per Day

    Flickr's lone database guy Dathan Pattishall made his excellent presentation available on how on how Flickr scales its backend to handle tremendous loads. Some of this information is available in Flickr Architecture, but the paper is so good it's worth another read. If you want to see sharding done right, at scale, take a look.

    Todd Hoff's picture

    Google's Paxos Made Live – An Engineering Perspective

    This is an unusually well written and useful paper. It talks in detail about experiences implementing a complex project, something we don't see very often. They shockingly even admit that creating a working implementation of Paxos was more difficult than just translating the pseudo code. Imagine that, programmers aren't merely typists! I particularly like the explanation of the Paxos algorithm and why anyone would care about it, working with disk corruption, using leases to support simultaneous reads, using epoch numbers to indicate a new master election, using snapshots to prevent unbounded logs, using MultiOp to implement database transactions, how they tested the system, and their openness with the various problems they had. A lot to learn here.

    From the paper:
    We describe our experience building a fault-tolerant data-base using the Paxos consensus algorithm.
    Despite the existing literature in the field, building such a database proved to be non-trivial. We describe
    selected algorithmic and engineering problems encountered, and the solutions we found for them. Our
    measurements indicate that we have built a competitive system.

    Todd Hoff's picture

    Lessons Learned at 208K: Towards Debugging Millions of Cores

    How do we debug and profile a cloud full of processors and threads? It's a problem more will be seeing as we code big scary programs that run on even bigger scarier clouds. Logging gets you far, but sometimes finding the root cause of problem requires delving deep into a program's execution. I don't know about you, but setting up 200,000+ gdb instances doesn't sound all that appealing. Tools like STAT (Stack Trace Analysis Tool) are being developed to help with this huge task. STAT "gathers and merges stack traces from a parallel application’s processes." So STAT isn't a low level debugger, but it will help you find the needle in a million haystacks.

    Abstract:


    Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application – already, debugging the full BlueGene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks.

    In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.

    Paper: Architecture of a Highly Scalable NIO-Based Server

    The article describes the basic architecture of a connection-oriented NIO-based java server. It takes a look at a preferred threading model, Java Non-blocking I/O and discusses the basic components of such a server.

    Paper: Asynchronous HTTP and Comet architectures

    Comet has popularized asynchronous non-blocking HTTP programming, making it practically indistinguishable from reverse Ajax, also known as server push. This JavaWorld article takes a wider view of asynchronous HTTP, explaining its role in developing high-performance HTTP proxies and non-blocking HTTP clients, as well as the long-lived HTTP connections associated with Comet.

    Todd Hoff's picture

    Paper: Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services

    Abstract: When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.

    Todd Hoff's picture

    Paper: Consensus Protocols: Paxos

    Update:Barbara Liskov’s Turing Award, and Byzantine Fault Tolerance.

    Henry Robinson has created an excellent series of articles on consensus protocols. We already covered his 2 Phase Commit article and he also has a 3 Phase Commit article showing how to handle 2PC under single node failures.

    But that is not enough! 3PC works well under node failures, but fails for network failures. So another consensus mechanism is needed that handles both network and node failures. And that's Paxos.

    Paxos correctly handles both types of failures, but it does this by becoming inaccessible if too many components fail. This is the "liveness" property of protocols. Paxos waits until the faults are fixed. Read queries can be handled,
    but updates will be blocked until the protocol thinks it can make forward progress.

    The liveness of Paxos is primarily dependent on network stability. In a distributed heterogeneous environment you are at risk of losing the ability to make updates. Users hate that.

    So when companies like Amazon do the seemingly insane thing of creating eventually consistent databases, it should be a little easier to understand now. Partitioning is required for scalability. Partitioning brings up these nasty consensus issues. Not being able to write under partition failures is unacceptable. Therefor create a system that can always write and work on consistency when all the downed partitions/networks are repaired.

    Todd Hoff's picture

    Paper: Consensus Protocols: Two-Phase Commit

    Henry Robinson has created an excellent series of articles on consensus protocols. Henry starts with a very useful discussion of what all this talk about consensus really means: The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network.

    In this article Henry tackles Two-Phase Commit, the protocol most databases use to arrive at a consensus for database writes. The article is very well written with lots of pretty and informative pictures. He did a really good job.

    In conclusion we learn 2PC is very efficient, a minimal number of messages are exchanged and latency is low. The problem is when a co-ordinator fails availability is dramatically reduced. This is why 2PC isn't generally used on highly distributed systems. To solve that problem we have to move on to different algorithms and that is the subject of other articles.

    Todd Hoff's picture

    Paper: Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

    Consistent hashing is one of those ideas that really puts the science in computer science and reminds us why all those really smart people spend years slaving over algorithms. Consistent hashing is "a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots" and was originally a way of distributing requests among a changing population of web servers. My first reaction to the idea was "wow, that's really smart" and I sadly realized I would never come up with something so elegant. I then immediately saw applications for it everywhere. And consistent hashing is used everywhere: distributed hash tables, overlay networks, P2P, IM, caching, and CDNs. Here's the abstract from the original paper and after the abstract are some links to a few very good articles with accessible explanations of consistent hashing and its applications in the real world.

    Todd Hoff's picture

    Paper: Container-based Operating System Virtualization: A Scalable, High-performance Alternative to Hypervisors

    One stumbling block of the the great march towards virtualization is the relatively poor performance of resource hungry applications like databases. We are told to develop and test using VMs, but deploy without them. Which kind of sucks IMHO. Maybe better virtualization technology can remove this split. This paper talks about a different approach to virtualization called "container-based" virtualization that can reportedly double the performance of traditional hypervisor systems like Xen. It does this by trading isolation for efficiency. Rather than maintaining complete isolation between VMs the container approach shares resources between VMs and thus gives higher performance while still guaranteeing strong fault, resource, and security isolation. It's yet another battle in computing's endless war of creating and destroying abstraction layers. I learned a lot from from this paper because of how it compared and contrasted traditional hypervisor and container based virtualization strategies. Good job.

    Todd Hoff's picture

    Paper: Designing Disaster Tolerant High Availability Clusters

    A very detailed (339 pages) paper on how to use HP products to create a highly available cluster. It's somewhat dated and obviously concentrates on HP products, but it is still good information.

    Table of contents:

    1. Disaster Tolerance and Recovery in a Serviceguard Cluster
    2. Building an Extended Distance Cluster Using ServiceGuard
    3. Designing a Metropolitan Cluster
    4. Designing a Continental Cluster
    5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP
    6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF
    7. Cascading Failover in a Continental Cluster

    Evaluating the Need for Disaster Tolerance
    What is a Disaster Tolerant Architecture?
    Types of Disaster Tolerant Clusters

    Extended Distance Clusters
    Metropolitan Cluster
    Continental Cluster
    Continental Cluster With Cascading Failover

    Disaster Tolerant Architecture Guidelines

    Protecting Nodes through Geographic Dispersion
    Protecting Data through Replication
    Using Alternative Power Sources
    Creating Highly Available Networking
    Disaster Tolerant Cluster Limitations

    Managing a Disaster Tolerant Environment
    Using this Guide with Your Disaster Tolerant Cluster Products

    2. Building an Extended Distance Cluster Using ServiceGuard

    Types of Data Link for Storage and Networking
    Two Data Center Architecture

    Two Data Center FibreChannel Implementations
    Advantages and Disadvantages of a Two-Data-Center Architecture

    Three Data Center Architectures
    Rules for Separate Network and Data Links
    Guidelines on DWDM Links for Network and Data

    3. Designing a Metropolitan Cluster

    Designing a Disaster Tolerant Architecture for use with Metrocluster Products

    Single Data Center
    Two Data Centers and Third Location with Arbitrator(s)

    Additional EMC SRDF Configurations

    Setting up Hardware for 1 by 1 Configurations
    Setting up Hardware for M by N Configurations

    Worksheets

    Disaster Tolerant Checklist
    Cluster Configuration Worksheet
    Package Configuration Worksheet

    Next Steps

    4. Designing a Continental Cluster

    Understanding Continental Cluster Concepts

    Mutual Recovery Configuration
    Application Recovery in a Continental Cluster
    Monitoring over a Wide Area Network
    Cluster Events
    Interpreting the Significance of Cluster Events
    How Notifications Work
    Alerts
    Alarms
    Creating Notifications for Failure Events
    Creating Notifications for Events that Indicate a Return of Service
    Performing Cluster Recovery
    Notes on Packages in a Continental Cluster
    How Serviceguard commands work in a Continentalcluster

    Designing a Disaster Tolerant Architecture for use with Continentalclusters

    Mutual Recovery
    Serviceguard Clusters
    Data Replication
    Highly Available Wide Area Networking
    Data Center Processes
    Continentalclusters Worksheets

    Preparing the Clusters

    Setting up and Testing Data Replication
    Configuring a Cluster without Recovery Packages
    Configuring a Cluster with Recovery Packages

    Building the Continentalclusters Configuration

    Preparing Security Files
    Creating the Monitor Package
    Editing the Continentalclusters Configuration File
    Checking and Applying the Continentalclusters Configuration
    Starting the Continentalclusters Monitor Package
    Validating the Configuration
    Documenting the Recovery Procedure
    Reviewing the Recovery Procedure

    Testing the Continental Cluster

    Testing Individual Packages
    Testing Continentalclusters Operations

    Switching to the Recovery Packages in Case of Disaster

    Receiving Notification
    Verifying that Recovery is Needed
    Using the Recovery Command to Switch All Packages
    How the cmrecovercl Command Works

    Forcing a Package to Start
    Restoring Disaster Tolerance

    Restore Clusters to their Original Roles
    Primary Packages Remain on the Surviving Cluster
    Primary Packages Remain on the Surviving Cluster using cmswitchconcl
    Newly Created Cluster Will Run Primary Packages
    Newly Created Cluster Will Function as Recovery Cluster for All Recovery Groups

    Maintaining a Continental Cluster

    Adding a Node to a Cluster or Removing a Node from a Cluster
    Adding a Package to the Continental Cluster
    Removing a Package from the Continental Cluster
    Changing Monitoring Definitions
    Checking the Status of Clusters, Nodes, and Packages
    Reviewing Messages and Log Files
    Deleting a Continental Cluster Configuration
    Renaming a Continental Cluster
    Checking Java File Versions
    Next Steps

    Support for Oracle RAC Instances in a Continentalclusters Environment

    Configuring the Environment for Continentalclusters to Support Oracle RAC
    Initial Startup of Oracle RAC Instance in a Continentalclusters Environment
    Failover of Oracle RAC Instances to the Recovery Site
    Failback of Oracle RAC Instances After a Failover

    5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP

    Files for Integrating XP Disk Arrays with Serviceguard Clusters
    Overview of Continuous Access XP Concepts

    PVOLs and SVOLs
    Device Groups and Fence Levels

    Creating the Cluster
    Preparing the Cluster for Data Replication

    Creating the RAID Manager Configuration
    Defining Storage Units

    Configuring Packages for Disaster Recovery
    Completing and Running a Metrocluster Solution with Continuous Access XP

    Maintaining a Cluster that uses Metrocluster/CA
    XP/CA Device Group Monitor

    Completing and Running a Continental Cluster Solution with Continuous Access XP

    Setting up a Primary Package on the Primary Cluster
    Setting up a Recovery Package on the Recovery Cluster
    Setting up the Continental Cluster Configuration
    Switching to the Recovery Cluster in Case of Disaster
    Failback Scenarios
    Maintaining the Continuous Access XP Data Replication Environment

    6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF

    Files for Integrating ServiceGuard with EMC SRDF
    Overview of EMC and SRDF Concepts
    Preparing the Cluster for Data Replication

    Installing the Necessary Software
    Building the Symmetrix CLI Database
    Determining Symmetrix Device Names on Each Node

    Building a Metrocluster Solution with EMC SRDF

    Setting up 1 by 1 Configurations
    Grouping the Symmetrix Devices at Each Data Center
    Setting up M by N Configurations
    Configuring Serviceguard Packages for Automatic Disaster Recovery
    Maintaining a Cluster that Uses Metrocluster/SRDF
    Managing Business Continuity Volumes
    R1/R2 Swapping

    Building a Continental Cluster Solution with EMC SRDF

    Setting up a Primary Package on the Primary Cluster
    Setting up a Recovery Package on the Recovery Cluster
    Setting up the Continental Cluster Configuration
    Switching to the Recovery Cluster in Case of Disaster
    Failback Scenarios
    Maintaining the EMC SRDF Data Replication Environment
    R1/R2 Swapping

    7. Cascading Failover in a Continental Cluster

    Overview

    Symmetrix Configuration
    Using Template Files

    Data Storage Setup

    Setting Up Symmetrix Device Groups
    Setting up Volume Groups
    Testing the Volume Groups

    Primary Cluster Package Setup
    Recovery Cluster Package Setup
    Continental Cluster Configuration
    Data Replication Procedures

    Data Initialization Procedures
    Data Refresh Procedures in the Steady State
    Data Replication in Failover and Failback Scenarios

    Todd Hoff's picture

    Paper: Dynamo: Amazon’s Highly Available Key-value Store

    Update 2: Read/WriteWeb has a good article talking about the scalability issues of relational databases and how Dynamo solves them: Amazon Dynamo: The Next Generation Of Virtual Distributed Storage. But since Dynamo is just another frustrating walled garden protected by barbed wire and guard dogs, its relevance is somewhat overstated.

    Update: Greg Linden has a take on the paper where he questions some of Amazon's design choices: emphasizing write availability over fast reads, a lack of indexing support, use of random distribution for load balancing, and punting on some scalability issues.

    Werner Vogels, Amazon's avuncular CTO, just announced a new paper on the internal database technology Amazon uses to handle tens of millions customers. I'll dive into more details later, but I thought you'd want to read it hot off the blog. The bad news is it won't be a service. They are keeping this tech not so secret, but very safe. Happily, it's another real-life example to learn from. As many top websites use a highly tuned key-value database at their core instead of a RDBMS, it's an important technology to understand.

    From the abstract you can get a feel for what the paper is about:

    Todd Hoff's picture

    Paper: Flux: An Adaptive Partitioning Operator for Continuous Query Systems

    At the core of the new real-time web, which is really really old, are continuous queries. I like how this paper proposed to handle dynamic demand and dynamic resource availability by making the underlying system adaptable, which seems like a very cloudy kind of thing to do.

    Abstract:


    The long-running nature of continuous queries poses new scalability challenges for dataflow processing. CQ systems execute pipelined dataflows that may be shared across multiple
    queries. The scalability of these dataflows is limited by their constituent, stateful operators – e.g. windowed joins or grouping operators. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques
    for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Longrunning CQ dataflows must continue to function robustly in
    the face of these imbalances. To address this challenge, we introduce a dataflow operator
    called Flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producerconsumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the Flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the Flux mechanism and
    these policies can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case

    Todd Hoff's picture

    Paper: GargantuanComputing—GRIDs and P2P

    I found the discussion of the available bandwidth of tree vs higher dimensional virtual networks topologies quite, to quote Spock, fascinating:

    A mathematical analysis by Ritter (2002) (one of the original developers
    of Napster) presented a detailed numerical argument demonstrating that the
    Gnutella network could not scale to the capacity of its competitor, the
    Napster network. Essentially, that model showed that the Gnutella network is
    severely bandwidth-limited long before the P2P population reaches a million
    peers. In each of these previous studies, the conclusions have overlooked the
    intrinsic bandwidth limits of the underlying topology in the Gnutella network:
    a Cayley tree (Rains and Sloane 1999) (see Sect. 9.4 for the definition).

    Trees are known to have lower aggregate bandwidth than higher dimensional
    topologies, e.g., hypercubes and hypertori. Studies of interconnection
    topologies in the literature have tended to focus on hardware implementations
    (see, e.g., Culler et al. 1996; Buyya 1999), which are generally limited
    by the cost of the chips and wires to a few thousand nodes. P2P networks,
    on the other hand, are intended to support from hundreds of thousands to
    millions of simultaneous peers, and since they are implemented in software,
    hyper-topologies are relatively unfettered by the economics of hardware.

    In this chapter, we analyze the scalability of several alternative topologies
    and compare their throughput up to 2–3 million peers. The virtual hypercube
    and the virtual hypertorus offer near-linear scalable bandwidth subject to
    the number of peer TCP/IP connections that can be simultaneously kept
    open.

    Todd Hoff's picture

    Paper: Graph Databases and the Future of Large-Scale Knowledge Management

    Relational databases, document databases, and distributed hash tables get most of the hype these days, but there's another option: graph databases. Back to the future it seems. Here's a really interesting paper by Marko A. Rodriguez introducing the graph model and it's extension to representing the world wide web of data.


    Modern day open source and commercial graph databases can store on the order of 1 billion relationships with some databases reaching the 10 billion mark. These developments are making the graph database practical for applications that require large-scale knowledge structures. Moreover, with
    the Web of Data standards set forth by the Linked Data community, it is possible to interlink graph databases across the web into a giant global knowledge structure. This talk will discuss graph databases, their underlying data model, their querying mechanisms, and the benefits of the graph data structure for modeling and analysis.

    Todd Hoff's picture

    Paper: Guide to Cost-effective Database Scale-Out using MySQL

    This paper is behind a registration-wall, you can't do anything on the MySQL site without filling out a form of some kind, but it's a short, decent introduction to using MySQL for a good sized website.

    A Quick Hit of What's Inside

    Scale-out vs. Scale Up, Customers using MySQL, Scale-Out Reference Architecture

    Todd Hoff's picture

    Paper: Lightweight Web servers

    This paper is a great overview of different lightweight web servers. A lot of websites use lightweight web servers to serve images and static content. YouTube is one example: http://highscalability.com/youtube-architecture.

    So if you need to improve performance consider changing over a different web server for some types of content.

    Todd Hoff's picture

    Paper: MapReduce: Simplified Data Processing on Large Clusters

    Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology.

    With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future.

    Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.

    One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster.

    From the abstract:

    Todd Hoff's picture

    Paper: MySQL Scale-Out by application partitioning

    Eventually every database system hit its limits. Especially
    on the Internet, where you have millions of users
    which theoretically access your database simultaneously,
    eventually your IO system will be a bottleneck. [A] promising but more complex solution with nearly no scale-out limits is application partitioning. If
    and when you get into the top-1000 rank on alexa [1], you have to think about such solutions.

    A Quick Hit of What's Inside

    Horizontal application partitioning, Vertical application partitioning, Disk IO calculations, How to partition an entity

    Todd Hoff's picture

    Paper: On Delivering Embarrassingly Distributed Cloud Services

    How do we scale datacenters? Should we build a few mammoth million machine datacenters or many smaller micro datacenters? Intuitively we usually go with a bigger is better economies of scale type argument, but it may not be so. What works for Walmart may not work for White Box World. Mega datacenters may actually exhibit diseconomies of scale. It may be better to run applications over many distributed micro datacenters instead of one large one.

    This paper by Ken Church, Albert Greenberg, and James Hamilton, all from Microsoft, takes a look at the different issues and concludes:


    Putting it all together, the micro model offers a design point with attractive performance, reliability, scale and cost. Given how much the industry is currently investing in the mega model, the industry would do well to consider the micro alternative.

    Todd Hoff's picture

    Paper: On Designing and Deploying Internet-Scale Services

    Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.

    The paper has too many details to cover here, but the big sections are:

  • Recommendations
  • Automatic Management and Provisioning
  • Dependency Management
  • Release Cycle and Testing
  • Operations and Capacity Planning
  • Graceful Degradation and Admission Control
  • Customer Self-Provisioning and Self-Help
  • Customer and Press Communication Plan

    In the recommendations we see some of our old favorites:

  • Todd Hoff's picture

    Paper: Optimistic Replication

    To scale in the large you have to partition. Data has to be spread around, replicated, and kept consistent (keeping replicas sufficiently similar to one another despite operations being submitted
    independently at different sites). The result is a highly available, well performing, and scalable system.

    Partitioning is required, but it's a pain to do efficiently and correctly. Until Quantum teleportation becomes a reality how data is kept consistent across a bewildering number of failure scenarios is a key design decision.

    This excellent paper by Yasushi Saito and Marc Shapiro takes us on a wild ride (OK, maybe not so wild) of different approaches to achieving consistency.

    What's cool about this paper is they go over some real systems that we are familiar with and cover how they work: DNS (single-master, state-transfer), Usenet (multi-master), PDAs (multi-master, state-transfer, manual or application-specific conflict resolution), Bayou (multi-master, operation-transfer, epidemic propagation, application conflict resolution), CVS (multi-master operation-transfer, centralized, manual conflict resolution).

    The paper then goes on to explain in detail the different approaches to achieving consistency. Most of us will never have to write the central nervous system of an application like this, but knowing about the different approaches and tradesoffs is priceless.

    The abstract:

    Paper: Pig Latin: A Not-So-Foreign Language for Data Processing

    Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds.

    The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject.

    The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to
    using Hadoop directly.

    References: Apache Pig Wiki

    Todd Hoff's picture

    Paper: Real-world Concurrency

    An excellent article by Bryan Cantrill and Jeff Bonwick on how to write multi-threaded code. With more processors and no magic bullet solution for how to use them, knowing how to write multiprocessor code that doesn't screw up your system is still a valuable skill. Some topics:

  • Know your cold paths from your hot paths.
  • Intuition is frequently wrong—be data intensive.
  • Know when—and when not—to break up a lock.
  • Be wary of readers/writer locks.
  • Consider per-CPU locking.
  • Know when to broadcast—and when to signal.
  • Learn to debug postmortem.
  • Design your systems to be composable.
  • Don't use a semaphore where a mutex would suffice.
  • Consider memory retiring to implement per-chain hash-table locks.
  • Be aware of false sharing.
  • Consider using nonblocking synchronization routines to monitor contention.
  • When reacquiring locks, consider using generation counts to detect state change.
  • Use wait- and lock-free structures only if you absolutely must.
  • Prepare for the thrill of victory—and the agony of defeat.

    While I don't agree that code using locks can be made composable, this articles covers a lot of very useful nitty-gritty details that will up your expert rating a couple points.

  • Todd Hoff's picture

    Paper: Replication Under Scalable Hashing

    Replication Under Scalable Hashing:
    A Family of Algorithms for Scalable Decentralized Data Distribution

    Typical algorithms for decentralized data distribution
    work best in a system that is fully built before it first used;
    adding or removing components results in either extensive
    reorganization of data or load imbalance in the system.
    We have developed a family of decentralized algorithms,
    RUSH (Replication Under Scalable Hashing), that
    maps replicated objects to a scalable collection of storage
    servers or disks. RUSH algorithms distribute objects to
    servers according to user-specified server weighting. While
    all RUSH variants support addition of servers to the system,
    different variants have different characteristics with
    respect to lookup time in petabyte-scale systems, performance
    with mirroring (as opposed to redundancy codes),
    and storage server removal. All RUSH variants redistribute
    as few objects as possible when new servers are
    added or existing servers are removed, and all variants
    guarantee that no two replicas of a particular object are
    ever placed on the same server. Because there is no central
    directory, clients can compute data locations in parallel,
    allowing thousands of clients to access objects on thousands
    of servers simultaneously.

    Todd Hoff's picture

    Paper: Scalability by Design - Coding for Systems With Large CPU Counts

    The multi-cores are coming and software designed for fewer cores usually doesn't work on more cores without substantial redesign. For a taste of the issues take a look at No new global mutexes! (and how to make the thread/connection pool work), which shows some of the difficulties of making MySQL perform on SMP servers.

    In this paper, Richard Smith, a –Staff Engineer at Sun, goes into some nice detail on multi-core issues. His take home lessons are:

  • Use fine-grained locking or lock-free strategy
  • Document the strategy, including correctness
    criteria (invariants)
  • Keep critical sections short
  • Profile the code at both light and heavy load
  • Collect HW performance counter data
  • Identify bottleneck resource (there's always at least one!)
  • Eliminate or ameliorate it

  • Paper: Scalability Design Patterns

    I have introduced pattern languages in my earlier post on The Pattern Bible for Distributed Computing.

    Achieving highest possible scalability is a complex combination of many factors. This PLoP 2007 paper presents a pattern language that can be used to make a system highly scalable.

    The Scalability Pattern Language introduced by Kanwardeep Singh Ahluwalia includes patterns to:

    • Introduce Scalability
    • Optimize Algorithm
    • Add Hardware
    • Add Parallelism
      • Add Intra-Process Parallelism
      • Add Inter-Porcess Parallelism
      • Add Hybrid Parallelism
    • Optimize Decentralization
    • Control Shared Resources
    • Automate Scalability

    Paper: Scaling Genome Sequencing - Complete Genomics Technology Overview

    Although the problem of scaling human genome sequencing is not exactly about building bigger, faster and more reliable websites it is most interesting in terms of scalability. The paper describes a new technology by the startup company Complete Genomics to sequence the full human genome for the fraction of the cost of earlier possibilities.

    Complete Genomics is building the world’s largest commercial human genome sequencing center to provide turnkey, outsourced complete human genome sequencing to customers worldwide.
    By 2010, their data center will contain approximately 60,000 processors with 30 petabytes of storage running their sequencing software on Linux clusters.

    Do you find this interesting and relevant to HighScalability.com?

    Todd Hoff's picture

    Paper: Sharding with Oracle Database

    The upshot of the paper is Oracle rules and MySQL sucks for sharding. Which is technically probable, if you don't throw in minor points like cost and ease of use. The points where they think Oracle wins: online schema changes, more robust replication, higher availability, better corruption handling, better use of large RAM and multiple cores, better and better tested partitioning features, better monitoring, and better gas mileage.

    Todd Hoff's picture

    Paper: Spamalytics: An Empirical Analysisof Spam Marketing Conversion

    Under the philosophy that the best method to analyse spam is to become a spammer, this absolutely fascinating paper recounts how a team of UC Berkely researchers went under cover to infiltrate a spam network. Part CSI, part Mission Impossible, and part MacGyver, the team hijacked the botnet so that their code was actually part of the dark network itself. Once inside they figured out the architecture and protocols of the botnet and how many sales they were able to tally. Truly elegant work.

    Two different spam campaigns were run on a Storm botnet network of 75,800 zombie computers. Storm is a peer-to-peer botnet that uses spam to creep its tentacles through the world wide computer network. One of the campains distributed viruses in order to recruit new bots into the network. This is normally accomplished by enticing people to download email attachments. An astonishing one in ten people downloaded the executable and ran it, which means we won't run out of zombies soon. The downloaded components include: Backdoor/downloader, SMTP relay, E-mail address stealer, E-mail virus spreader, Distributed denial of service (DDos) attack tool, pdated copy of Storm Worm dropper. The second campaign sent pharmacuticle spam ("libido boosting herbal remedy”) over the network.

    Haven't you always wondered who clicks on spam and how much could spammers possibly make? In the study only 28 sales resulted from 350 million spam e-mail messages sent over 26 days. A conversion rate of well under 0.00001% (typical advertising campaign might have a conversion of 2-3%). The average purchase price was about $100 for $2,731.88 in total revenue. The reserchers estimate total daily revenue attributable to Storm’s pharmacy campaign is about $7000 and that they pick up between 3500 and 8500 new bots per day through their Trojan distribution system. And this is with only 1.5% of the entire network in use.

    So, the spammers would take in total revenue about $3.5 million a year from one product from one network. Imagine the take with multiple products and multiple networks? That's why we still have spam. And since the conversion rate is already so low, it seems spam will always be with us.

    As fascinating as all the spamonomics are, the explanation of the botnet architecture is just as fascinating. Storm uses a three-level self-organizing hierarchy pictured here:

    Todd Hoff's picture

    Paper: Standardizing Storage Clusters (with pNFS)

    pNFS (parallel NFS) is the next generation of NFS and its main claim to fame is that it's clustered, which "enables clients to directly access file data spread over multiple storage servers in parallel. As a result, each client can leverage the full aggregate bandwidth of a clustered storage service at the granularity of an individual file." About pNFS StorageMojo says: pNFS is going to commoditize parallel data access. In 5 years we won’t know how we got along without it. Something to watch.

    Todd Hoff's picture

    Paper: The Clustered Storage Revolution

    If the clustered file system, clustered storage system, storage virtualization movement is new to you then this is a good intro paper. I's a both vendor puff piece and informative, so it might be worth your time.

    A Quick Hit of What's Inside

    Clustered storage architectures have the ability to pull together two or more storage devices to behave as a single entity. Clustered storage can be broken down into three types:
    * 2-way simple failover clustering
    * Namespace aggregation
    * Clustered storage with a distributed file systems (DFS)

    Todd Hoff's picture

    Paper: The End of an Architectural Era (It’s Time for a Complete Rewrite)

    Update 3: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better.
    Update 2: H-Store: A Next Generation OLTP DBMS is the project implementing the ideas in this paper: The goal of the H-Store project is to investigate how these architectural and application shifts affect the performance of OLTP databases, and to study what performance benefits would be possible with a complete redesign of OLTP systems in light of these trends. Our early results show that a simple prototype built from scratch using modern assumptions can outperform current commercial DBMS offerings by around a factor of 80 on OLTP workloads.
    Update: interesting related thread on Lamda the Ultimate.

    A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube lately. The spirit of the paper is found in the following excerpt:

    In summary, the current RDBMSs were architected for the business data processing market in a time of different user interfaces and different hardware characteristics. Hence, they all include the following System R architectural features:
    * Disk oriented storage and indexing structures
    * Multithreading to hide latency
    * Locking-based concurrency control mechanisms
    * Log-based recovery

    Todd Hoff's picture

    Paper: Understanding and Building High Availability/Load Balanced Clusters

    A superb explanation by Theo Schlossnagle of how to deploy a high availability load balanced system using mod backhand and Wackamole. The idea is you don't need to buy expensive redundant hardware load balancers, you can make use of the hosts you already have to the same effect. The discussion of using peer-based HA solutions versus a single front-end HA device is well worth the read. Another interesting perspective in the document is to view load balancing as a resource allocation problem. There's also a nice discussion of the negative of effect of keep-alives on performance.

    Paper: Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments

    Authors:
    Kevin Lim
    Parthasarathy Ranganathan
    Jichuan Chang
    Chandrakant Patel
    Trevor Mudge
    Steven Reinhardt

    This International Symposium on Computer Architecture paper seeks to understand and design next-generation servers for emerging "warehouse-computing" environments. We make two key contributions. First, we put together a detailed evaluation infrastructure including a new benchmark suite for warehouse-computing workloads, and detailed performance, cost, and power models, to quantitatively characterize bottlenecks. Second, we study a new solution that incorporates volume non-server-class components in novel packaging solutions, with memory sharing and flash-based disk caching. Our results show that this approach has promise, with a 2X improvement on average in performance-per-dollar for our benchmark suite.

    Todd Hoff's picture

    Paper: Wikipedia's Site Internals, Configuration, Code Examples and Management Issues

    Wikipedia and Wikimedia have some of the best, most complete real-world documentation on how to build highly scalable systems. This paper by Domas Mituzas covers a lot of details about how Wikipedia works, including: an overview of the different packages used (Linux, PowerDNS, LVS, Squid, lighttpd, Apache, PHP5, Lucene, Mono, Memcached), how they use their CDN, how caching works, how they profile their code, how they store their media, how they structure their database access, how they handle search, how they handle load balancing and administration. All with real code examples and examples of configuration files. This is a really useful resource.