| Paper |
Abstract—This paper presents the architecture and
characteristics of a memory database intended to be used as a
cache engine for web applications. Primary goals of this database
are speed and efficiency while running on SMP systems with
several CPU cores (four and more). A secondary goal is the
support for simple metadata structures associated with cached
data that can aid in efficient use of the cache. Due to these goals,
some data structures and algorithms normally associated with
this field of computing needed to be adapted to the new
environment.
Looks interesting...
Abstract:
Today’s data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Nonuniform bandwidth among data center nodes complicates application design and limits overall system performance.
In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today’s higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.
Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times.
FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing methods, bitmap indices are superior because they can be efficiently combined to answer multi-dimensional queries whereas other optimal methods can not.
It's not all just map-reduce and add more servers until your attic is full.
Flickr's lone database guy Dathan Pattishall made his excellent presentation available on how on how Flickr scales its backend to handle tremendous loads. Some of this information is available in Flickr Architecture, but the paper is so good it's worth another read. If you want to see sharding done right, at scale, take a look.
This is an unusually well written and useful paper. It talks in detail about experiences implementing a complex project, something we don't see very often. They shockingly even admit that creating a working implementation of Paxos was more difficult than just translating the pseudo code. Imagine that, programmers aren't merely typists! I particularly like the explanation of the Paxos algorithm and why anyone would care about it, working with disk corruption, using leases to support simultaneous reads, using epoch numbers to indicate a new master election, using snapshots to prevent unbounded logs, using MultiOp to implement database transactions, how they tested the system, and their openness with the various problems they had. A lot to learn here.
From the paper:
We describe our experience building a fault-tolerant data-base using the Paxos consensus algorithm.
Despite the existing literature in the field, building such a database proved to be non-trivial. We describe
selected algorithmic and engineering problems encountered, and the solutions we found for them. Our
measurements indicate that we have built a competitive system.
How do we debug and profile a cloud full of processors and threads? It's a problem more will be seeing as we code big scary programs that run on even bigger scarier clouds. Logging gets you far, but sometimes finding the root cause of problem requires delving deep into a program's execution. I don't know about you, but setting up 200,000+ gdb instances doesn't sound all that appealing. Tools like STAT (Stack Trace Analysis Tool) are being developed to help with this huge task. STAT "gathers and merges stack traces from a parallel application’s processes." So STAT isn't a low level debugger, but it will help you find the needle in a million haystacks.
Abstract:
Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application – already, debugging the full BlueGene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks.In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.
The article describes the basic architecture of a connection-oriented NIO-based java server. It takes a look at a preferred threading model, Java Non-blocking I/O and discusses the basic components of such a server.
Comet has popularized asynchronous non-blocking HTTP programming, making it practically indistinguishable from reverse Ajax, also known as server push. This JavaWorld article takes a wider view of asynchronous HTTP, explaining its role in developing high-performance HTTP proxies and non-blocking HTTP clients, as well as the long-lived HTTP connections associated with Comet.
Abstract: When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.
Update:Barbara Liskov’s Turing Award, and Byzantine Fault Tolerance.
Henry Robinson has created an excellent series of articles on consensus protocols. We already covered his 2 Phase Commit article and he also has a 3 Phase Commit article showing how to handle 2PC under single node failures.
But that is not enough! 3PC works well under node failures, but fails for network failures. So another consensus mechanism is needed that handles both network and node failures. And that's Paxos.
Paxos correctly handles both types of failures, but it does this by becoming inaccessible if too many components fail. This is the "liveness" property of protocols. Paxos waits until the faults are fixed. Read queries can be handled,
but updates will be blocked until the protocol thinks it can make forward progress.
The liveness of Paxos is primarily dependent on network stability. In a distributed heterogeneous environment you are at risk of losing the ability to make updates. Users hate that.
So when companies like Amazon do the seemingly insane thing of creating eventually consistent databases, it should be a little easier to understand now. Partitioning is required for scalability. Partitioning brings up these nasty consensus issues. Not being able to write under partition failures is unacceptable. Therefor create a system that can always write and work on consistency when all the downed partitions/networks are repaired.
Henry Robinson has created an excellent series of articles on consensus protocols. Henry starts with a very useful discussion of what all this talk about consensus really means: The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network.
In this article Henry tackles Two-Phase Commit, the protocol most databases use to arrive at a consensus for database writes. The article is very well written with lots of pretty and informative pictures. He did a really good job.
In conclusion we learn 2PC is very efficient, a minimal number of messages are exchanged and latency is low. The problem is when a co-ordinator fails availability is dramatically reduced. This is why 2PC isn't generally used on highly distributed systems. To solve that problem we have to move on to different algorithms and that is the subject of other articles.
Consistent hashing is one of those ideas that really puts the science in computer science and reminds us why all those really smart people spend years slaving over algorithms. Consistent hashing is "a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots" and was originally a way of distributing requests among a changing population of web servers. My first reaction to the idea was "wow, that's really smart" and I sadly realized I would never come up with something so elegant. I then immediately saw applications for it everywhere. And consistent hashing is used everywhere: distributed hash tables, overlay networks, P2P, IM, caching, and CDNs. Here's the abstract from the original paper and after the abstract are some links to a few very good articles with accessible explanations of consistent hashing and its applications in the real world.
One stumbling block of the the great march towards virtualization is the relatively poor performance of resource hungry applications like databases. We are told to develop and test using VMs, but deploy without them. Which kind of sucks IMHO. Maybe better virtualization technology can remove this split. This paper talks about a different approach to virtualization called "container-based" virtualization that can reportedly double the performance of traditional hypervisor systems like Xen. It does this by trading isolation for efficiency. Rather than maintaining complete isolation between VMs the container approach shares resources between VMs and thus gives higher performance while still guaranteeing strong fault, resource, and security isolation. It's yet another battle in computing's endless war of creating and destroying abstraction layers. I learned a lot from from this paper because of how it compared and contrasted traditional hypervisor and container based virtualization strategies. Good job.
A very detailed (339 pages) paper on how to use HP products to create a highly available cluster. It's somewhat dated and obviously concentrates on HP products, but it is still good information.
Table of contents:
1. Disaster Tolerance and Recovery in a Serviceguard Cluster
2. Building an Extended Distance Cluster Using ServiceGuard
3. Designing a Metropolitan Cluster
4. Designing a Continental Cluster
5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP
6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF
7. Cascading Failover in a Continental Cluster
Evaluating the Need for Disaster Tolerance
What is a Disaster Tolerant Architecture?
Types of Disaster Tolerant Clusters
Extended Distance Clusters
Metropolitan Cluster
Continental Cluster
Continental Cluster With Cascading Failover
Disaster Tolerant Architecture Guidelines
Protecting Nodes through Geographic Dispersion
Protecting Data through Replication
Using Alternative Power Sources
Creating Highly Available Networking
Disaster Tolerant Cluster Limitations
Managing a Disaster Tolerant Environment
Using this Guide with Your Disaster Tolerant Cluster Products
2. Building an Extended Distance Cluster Using ServiceGuard
Types of Data Link for Storage and Networking
Two Data Center Architecture
Two Data Center FibreChannel Implementations
Advantages and Disadvantages of a Two-Data-Center Architecture
Three Data Center Architectures
Rules for Separate Network and Data Links
Guidelines on DWDM Links for Network and Data
3. Designing a Metropolitan Cluster
Designing a Disaster Tolerant Architecture for use with Metrocluster Products
Single Data Center
Two Data Centers and Third Location with Arbitrator(s)
Additional EMC SRDF Configurations
Setting up Hardware for 1 by 1 Configurations
Setting up Hardware for M by N Configurations
Worksheets
Disaster Tolerant Checklist
Cluster Configuration Worksheet
Package Configuration Worksheet
Next Steps
4. Designing a Continental Cluster
Understanding Continental Cluster Concepts
Mutual Recovery Configuration
Application Recovery in a Continental Cluster
Monitoring over a Wide Area Network
Cluster Events
Interpreting the Significance of Cluster Events
How Notifications Work
Alerts
Alarms
Creating Notifications for Failure Events
Creating Notifications for Events that Indicate a Return of Service
Performing Cluster Recovery
Notes on Packages in a Continental Cluster
How Serviceguard commands work in a Continentalcluster
Designing a Disaster Tolerant Architecture for use with Continentalclusters
Mutual Recovery
Serviceguard Clusters
Data Replication
Highly Available Wide Area Networking
Data Center Processes
Continentalclusters Worksheets
Preparing the Clusters
Setting up and Testing Data Replication
Configuring a Cluster without Recovery Packages
Configuring a Cluster with Recovery Packages
Building the Continentalclusters Configuration
Preparing Security Files
Creating the Monitor Package
Editing the Continentalclusters Configuration File
Checking and Applying the Continentalclusters Configuration
Starting the Continentalclusters Monitor Package
Validating the Configuration
Documenting the Recovery Procedure
Reviewing the Recovery Procedure
Testing the Continental Cluster
Testing Individual Packages
Testing Continentalclusters Operations
Switching to the Recovery Packages in Case of Disaster
Receiving Notification
Verifying that Recovery is Needed
Using the Recovery Command to Switch All Packages
How the cmrecovercl Command Works
Forcing a Package to Start
Restoring Disaster Tolerance
Restore Clusters to their Original Roles
Primary Packages Remain on the Surviving Cluster
Primary Packages Remain on the Surviving Cluster using cmswitchconcl
Newly Created Cluster Will Run Primary Packages
Newly Created Cluster Will Function as Recovery Cluster for All Recovery Groups
Maintaining a Continental Cluster
Adding a Node to a Cluster or Removing a Node from a Cluster
Adding a Package to the Continental Cluster
Removing a Package from the Continental Cluster
Changing Monitoring Definitions
Checking the Status of Clusters, Nodes, and Packages
Reviewing Messages and Log Files
Deleting a Continental Cluster Configuration
Renaming a Continental Cluster
Checking Java File Versions
Next Steps
Support for Oracle RAC Instances in a Continentalclusters Environment
Configuring the Environment for Continentalclusters to Support Oracle RAC
Initial Startup of Oracle RAC Instance in a Continentalclusters Environment
Failover of Oracle RAC Instances to the Recovery Site
Failback of Oracle RAC Instances After a Failover
5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP
Files for Integrating XP Disk Arrays with Serviceguard Clusters
Overview of Continuous Access XP Concepts
PVOLs and SVOLs
Device Groups and Fence Levels
Creating the Cluster
Preparing the Cluster for Data Replication
Creating the RAID Manager Configuration
Defining Storage Units
Configuring Packages for Disaster Recovery
Completing and Running a Metrocluster Solution with Continuous Access XP
Maintaining a Cluster that uses Metrocluster/CA
XP/CA Device Group Monitor
Completing and Running a Continental Cluster Solution with Continuous Access XP
Setting up a Primary Package on the Primary Cluster
Setting up a Recovery Package on the Recovery Cluster
Setting up the Continental Cluster Configuration
Switching to the Recovery Cluster in Case of Disaster
Failback Scenarios
Maintaining the Continuous Access XP Data Replication Environment
6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF
Files for Integrating ServiceGuard with EMC SRDF
Overview of EMC and SRDF Concepts
Preparing the Cluster for Data Replication
Installing the Necessary Software
Building the Symmetrix CLI Database
Determining Symmetrix Device Names on Each Node
Building a Metrocluster Solution with EMC SRDF
Setting up 1 by 1 Configurations
Grouping the Symmetrix Devices at Each Data Center
Setting up M by N Configurations
Configuring Serviceguard Packages for Automatic Disaster Recovery
Maintaining a Cluster that Uses Metrocluster/SRDF
Managing Business Continuity Volumes
R1/R2 Swapping
Building a Continental Cluster Solution with EMC SRDF
Setting up a Primary Package on the Primary Cluster
Setting up a Recovery Package on the Recovery Cluster
Setting up the Continental Cluster Configuration
Switching to the Recovery Cluster in Case of Disaster
Failback Scenarios
Maintaining the EMC SRDF Data Replication Environment
R1/R2 Swapping
7. Cascading Failover in a Continental Cluster
Overview
Symmetrix Configuration
Using Template Files
Data Storage Setup
Setting Up Symmetrix Device Groups
Setting up Volume Groups
Testing the Volume Groups
Primary Cluster Package Setup
Recovery Cluster Package Setup
Continental Cluster Configuration
Data Replication Procedures
Data Initialization Procedures
Data Refresh Procedures in the Steady State
Data Replication in Failover and Failback Scenarios
Update 2: Read/WriteWeb has a good article talking about the scalability issues of relational databases and how Dynamo solves them: Amazon Dynamo: The Next Generation Of Virtual Distributed Storage. But since Dynamo is just another frustrating walled garden protected by barbed wire and guard dogs, its relevance is somewhat overstated.
Update: Greg Linden has a take on the paper where he questions some of Amazon's design choices: emphasizing write availability over fast reads, a lack of indexing support, use of random distribution for load balancing, and punting on some scalability issues.
Werner Vogels, Amazon's avuncular CTO, just announced a new paper on the internal database technology Amazon uses to handle tens of millions customers. I'll dive into more details later, but I thought you'd want to read it hot off the blog. The bad news is it won't be a service. They are keeping this tech not so secret, but very safe. Happily, it's another real-life example to learn from. As many top websites use a highly tuned key-value database at their core instead of a RDBMS, it's an important technology to understand.
From the abstract you can get a feel for what the paper is about:
At the core of the new real-time web, which is really really old, are continuous queries. I like how this paper proposed to handle dynamic demand and dynamic resource availability by making the underlying system adaptable, which seems like a very cloudy kind of thing to do.
Abstract:
The long-running nature of continuous queries poses new scalability challenges for dataflow processing. CQ systems execute pipelined dataflows that may be shared across multiple
queries. The scalability of these dataflows is limited by their constituent, stateful operators – e.g. windowed joins or grouping operators. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques
for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Longrunning CQ dataflows must continue to function robustly in
the face of these imbalances. To address this challenge, we introduce a dataflow operator
called Flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producerconsumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the Flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the Flux mechanism and
these policies can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case
I found the discussion of the available bandwidth of tree vs higher dimensional virtual networks topologies quite, to quote Spock, fascinating:
A mathematical analysis by Ritter (2002) (one of the original developers
of Napster) presented a detailed numerical argument demonstrating that the
Gnutella network could not scale to the capacity of its competitor, the
Napster network. Essentially, that model showed that the Gnutella network is
severely bandwidth-limited long before the P2P population reaches a million
peers. In each of these previous studies, the conclusions have overlooked the
intrinsic bandwidth limits of the underlying topology in the Gnutella network:
a Cayley tree (Rains and Sloane 1999) (see Sect. 9.4 for the definition).
Trees are known to have lower aggregate bandwidth than higher dimensional
topologies, e.g., hypercubes and hypertori. Studies of interconnection
topologies in the literature have tended to focus on hardware implementations
(see, e.g., Culler et al. 1996; Buyya 1999), which are generally limited
by the cost of the chips and wires to a few thousand nodes. P2P networks,
on the other hand, are intended to support from hundreds of thousands to
millions of simultaneous peers, and since they are implemented in software,
hyper-topologies are relatively unfettered by the economics of hardware.
In this chapter, we analyze the scalability of several alternative topologies
and compare their throughput up to 2–3 million peers. The virtual hypercube
and the virtual hypertorus offer near-linear scalable bandwidth subject to
the number of peer TCP/IP connections that can be simultaneously kept
open.
Relational databases, document databases, and distributed hash tables get most of the hype these days, but there's another option: graph databases. Back to the future it seems. Here's a really interesting paper by Marko A. Rodriguez introducing the graph model and it's extension to representing the world wide web of data.
Modern day open source and commercial graph databases can store on the order of 1 billion relationships with some databases reaching the 10 billion mark. These developments are making the graph database practical for applications that require large-scale knowledge structures. Moreover, with
the Web of Data standards set forth by the Linked Data community, it is possible to interlink graph databases across the web into a giant global knowledge structure. This talk will discuss graph databases, their underlying data model, their querying mechanisms, and the benefits of the graph data structure for modeling and analysis.
This paper is behind a registration-wall, you can't do anything on the MySQL site without filling out a form of some kind, but it's a short, decent introduction to using MySQL for a good sized website.
Scale-out vs. Scale Up, Customers using MySQL, Scale-Out Reference Architecture
This paper is a great overview of different lightweight web servers. A lot of websites use lightweight web servers to serve images and static content. YouTube is one example: http://highscalability.com/youtube-architecture.
So if you need to improve performance consider changing over a different web server for some types of content.
Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology.
With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future.
Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.
One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster.
From the abstract:
Eventually every database system hit its limits. Especially
on the Internet, where you have millions of users
which theoretically access your database simultaneously,
eventually your IO system will be a bottleneck. [A] promising but more complex solution with nearly no scale-out limits is application partitioning. If
and when you get into the top-1000 rank on alexa [1], you have to think about such solutions.
Horizontal application partitioning, Vertical application partitioning, Disk IO calculations, How to partition an entity
How do we scale datacenters? Should we build a few mammoth million machine datacenters or many smaller micro datacenters? Intuitively we usually go with a bigger is better economies of scale type argument, but it may not be so. What works for Walmart may not work for White Box World. Mega datacenters may actually exhibit diseconomies of scale. It may be better to run applications over many distributed micro datacenters instead of one large one.
This paper by Ken Church, Albert Greenberg, and James Hamilton, all from Microsoft, takes a look at the different issues and concludes:
Putting it all together, the micro model offers a design point with attractive performance, reliability, scale and cost. Given how much the industry is currently investing in the mega model, the industry would do well to consider the micro alternative.
Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.
The paper has too many details to cover here, but the big sections are:
In the recommendations we see some of our old favorites:
To scale in the large you have to partition. Data has to be spread around, replicated, and kept consistent (keeping replicas sufficiently similar to one another despite operations being submitted
independently at different sites). The result is a highly available, well performing, and scalable system.
Partitioning is required, but it's a pain to do efficiently and correctly. Until Quantum teleportation becomes a reality how data is kept consistent across a bewildering number of failure scenarios is a key design decision.
This excellent paper by Yasushi Saito and Marc Shapiro takes us on a wild ride (OK, maybe not so wild) of different approaches to achieving consistency.
What's cool about this paper is they go over some real systems that we are familiar with and cover how they work: DNS (single-master, state-transfer), Usenet (multi-master), PDAs (multi-master, state-transfer, manual or application-specific conflict resolution), Bayou (multi-master, operation-transfer, epidemic propagation, application conflict resolution), CVS (multi-master operation-transfer, centralized, manual conflict resolution).
The paper then goes on to explain in detail the different approaches to achieving consistency. Most of us will never have to write the central nervous system of an application like this, but knowing about the different approaches and tradesoffs is priceless.
The abstract:
Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds.
The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject.
The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to
using Hadoop directly.
References: Apache Pig Wiki
An excellent article by Bryan Cantrill and Jeff Bonwick on how to write multi-threaded code. With more processors and no magic bullet solution for how to use them, knowing how to write multiprocessor code that doesn't screw up your system is still a valuable skill. Some topics:
While I don't agree that code using locks can be made composable, this articles covers a lot of very useful nitty-gritty details that will up your expert rating a couple points.
Replication Under Scalable Hashing:
A Family of Algorithms for Scalable Decentralized Data Distribution
Typical algorithms for decentralized data distribution
work best in a system that is fully built before it first used;
adding or removing components results in either extensive
reorganization of data or load imbalance in the system.
We have developed a family of decentralized algorithms,
RUSH (Replication Under Scalable Hashing), that
maps replicated objects to a scalable collection of storage
servers or disks. RUSH algorithms distribute objects to
servers according to user-specified server weighting. While
all RUSH variants support addition of servers to the system,
different variants have different characteristics with
respect to lookup time in petabyte-scale systems, performance
with mirroring (as opposed to redundancy codes),
and storage server removal. All RUSH variants redistribute
as few objects as possible when new servers are
added or existing servers are removed, and all variants
guarantee that no two replicas of a particular object are
ever placed on the same server. Because there is no central
directory, clients can compute data locations in parallel,
allowing thousands of clients to access objects on thousands
of servers simultaneously.
The multi-cores are coming and software designed for fewer cores usually doesn't work on more cores without substantial redesign. For a taste of the issues take a look at No new global mutexes! (and how to make the thread/connection pool work), which shows some of the difficulties of making MySQL perform on SMP servers.
In this paper, Richard Smith, a Staff Engineer at Sun, goes into some nice detail on multi-core issues. His take home lessons are:
I have introduced pattern languages in my earlier post on The Pattern Bible for Distributed Computing.
Achieving highest possible scalability is a complex combination of many factors. This PLoP 2007 paper presents a pattern language that can be used to make a system highly scalable.
The Scalability Pattern Language introduced by Kanwardeep Singh Ahluwalia includes patterns to:
Although the problem of scaling human genome sequencing is not exactly about building bigger, faster and more reliable websites it is most interesting in terms of scalability. The paper describes a new technology by the startup company Complete Genomics to sequence the full human genome for the fraction of the cost of earlier possibilities.
Complete Genomics is building the world’s largest commercial human genome sequencing center to provide turnkey, outsourced complete human genome sequencing to customers worldwide.
By 2010, their data center will contain approximately 60,000 processors with 30 petabytes of storage running their sequencing software on Linux clusters.
Do you find this interesting and relevant to HighScalability.com?
The upshot of the paper is Oracle rules and MySQL sucks for sharding. Which is technically probable, if you don't throw in minor points like cost and ease of use. The points where they think Oracle wins: online schema changes, more robust replication, higher availability, better corruption handling, better use of large RAM and multiple cores, better and better tested partitioning features, better monitoring, and better gas mileage.
Under the philosophy that the best method to analyse spam is to become a spammer, this absolutely fascinating paper recounts how a team of UC Berkely researchers went under cover to infiltrate a spam network. Part CSI, part Mission Impossible, and part MacGyver, the team hijacked the botnet so that their code was actually part of the dark network itself. Once inside they figured out the architecture and protocols of the botnet and how many sales they were able to tally. Truly elegant work.
Two different spam campaigns were run on a Storm botnet network of 75,800 zombie computers. Storm is a peer-to-peer botnet that uses spam to creep its tentacles through the world wide computer network. One of the campains distributed viruses in order to recruit new bots into the network. This is normally accomplished by enticing people to download email attachments. An astonishing one in ten people downloaded the executable and ran it, which means we won't run out of zombies soon. The downloaded components include: Backdoor/downloader, SMTP relay, E-mail address stealer, E-mail virus spreader, Distributed denial of service (DDos) attack tool, pdated copy of Storm Worm dropper. The second campaign sent pharmacuticle spam ("libido boosting herbal remedy”) over the network.
Haven't you always wondered who clicks on spam and how much could spammers possibly make? In the study only 28 sales resulted from 350 million spam e-mail messages sent over 26 days. A conversion rate of well under 0.00001% (typical advertising campaign might have a conversion of 2-3%). The average purchase price was about $100 for $2,731.88 in total revenue. The reserchers estimate total daily revenue attributable to Storm’s pharmacy campaign is about $7000 and that they pick up between 3500 and 8500 new bots per day through their Trojan distribution system. And this is with only 1.5% of the entire network in use.
So, the spammers would take in total revenue about $3.5 million a year from one product from one network. Imagine the take with multiple products and multiple networks? That's why we still have spam. And since the conversion rate is already so low, it seems spam will always be with us.
As fascinating as all the spamonomics are, the explanation of the botnet architecture is just as fascinating. Storm uses a three-level self-organizing hierarchy pictured here:

pNFS (parallel NFS) is the next generation of NFS and its main claim to fame is that it's clustered, which "enables clients to directly access file data spread over multiple storage servers in parallel. As a result, each client can leverage the full aggregate bandwidth of a clustered storage service at the granularity of an individual file." About pNFS StorageMojo says: pNFS is going to commoditize parallel data access. In 5 years we won’t know how we got along without it. Something to watch.
If the clustered file system, clustered storage system, storage virtualization movement is new to you then this is a good intro paper. I's a both vendor puff piece and informative, so it might be worth your time.
Clustered storage architectures have the ability to pull together two or more storage devices to behave as a single entity. Clustered storage can be broken down into three types:
* 2-way simple failover clustering
* Namespace aggregation
* Clustered storage with a distributed file systems (DFS)
Update 3: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better.
Update 2: H-Store: A Next Generation OLTP DBMS is the project implementing the ideas in this paper: The goal of the H-Store project is to investigate how these architectural and application shifts affect the performance of OLTP databases, and to study what performance benefits would be possible with a complete redesign of OLTP systems in light of these trends. Our early results show that a simple prototype built from scratch using modern assumptions can outperform current commercial DBMS offerings by around a factor of 80 on OLTP workloads.
Update: interesting related thread on Lamda the Ultimate.
A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube lately. The spirit of the paper is found in the following excerpt:
In summary, the current RDBMSs were architected for the business data processing market in a time of different user interfaces and different hardware characteristics. Hence, they all include the following System R architectural features:
* Disk oriented storage and indexing structures
* Multithreading to hide latency
* Locking-based concurrency control mechanisms
* Log-based recovery
A superb explanation by Theo Schlossnagle of how to deploy a high availability load balanced system using mod backhand and Wackamole. The idea is you don't need to buy expensive redundant hardware load balancers, you can make use of the hosts you already have to the same effect. The discussion of using peer-based HA solutions versus a single front-end HA device is well worth the read. Another interesting perspective in the document is to view load balancing as a resource allocation problem. There's also a nice discussion of the negative of effect of keep-alives on performance.
Authors:
Kevin Lim
Parthasarathy Ranganathan
Jichuan Chang
Chandrakant Patel
Trevor Mudge
Steven Reinhardt
This International Symposium on Computer Architecture paper seeks to understand and design next-generation servers for emerging "warehouse-computing" environments. We make two key contributions. First, we put together a detailed evaluation infrastructure including a new benchmark suite for warehouse-computing workloads, and detailed performance, cost, and power models, to quantitatively characterize bottlenecks. Second, we study a new solution that incorporates volume non-server-class components in novel packaging solutions, with memory sharing and flash-based disk caching. Our results show that this approach has promise, with a 2X improvement on average in performance-per-dollar for our benchmark suite.
Wikipedia and Wikimedia have some of the best, most complete real-world documentation on how to build highly scalable systems. This paper by Domas Mituzas covers a lot of details about how Wikipedia works, including: an overview of the different packages used (Linux, PowerDNS, LVS, Squid, lighttpd, Apache, PHP5, Lucene, Mono, Memcached), how they use their CDN, how caching works, how they profile their code, how they store their media, how they structure their database access, how they handle search, how they handle load balancing and administration. All with real code examples and examples of configuration files. This is a really useful resource.
Recent comments
5 hours 15 min ago
11 hours 42 min ago
11 hours 52 min ago
21 hours 59 min ago
22 hours 11 min ago
1 day 1 hour ago
2 days 9 hours ago
2 days 10 hours ago
2 days 12 hours ago
2 days 14 hours ago