Geo-distributed Clusters

Todd Hoff's picture

Manage Downtime Risk by Connecting Multiple Data Centers into a Secure Virtual LAN

True high availability requires a presence in multiple data centers. The recent downtime of even a high quality operation like Amazon makes this need all the more clear. Typically only the big boys can afford the complexity of operating in two or more data centers. Cloud computing along with utility billing starts to change that equation, leveling the playing field. Even smaller outfits will be in a position to manage risk by spreading machines amongst EC2, 3tera, Slicehost, Mosso and other providers.

The question then becomes: given we aren't Angels, how do we walk amongst the clouds? One fascinating answer is exquisitely explained by Dmitriy Samovskiy in his Linux Journal article titled Building a Multisourced Infrastructure Using OpenVPN.

Todd Hoff's picture

Google Architecture

Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters. Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.

Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?

Todd Hoff's picture

Wikimedia architecture

Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet.

Todd Hoff's picture

Product: 3 PAR REMOTE COPY

3PAR Remote Copy is a uniquely simple and efficient replication technology that allows customers to protect and share any application data affordably. Built upon 3PAR Thin Copy technology, Remote Copy lowers the total cost of storage by addressing the cost and complexity of remote replication.

Common Uses of 3PAR Remote Copy:

Affordable Disaster Recovery: Mirror data cost-effectively across town or across the world.

Centralized Archive: Replicate data from multiple 3PAR InServs located in multiple data centers to a centralized data archive location.

Resilient Pod Architecture: Mutually replicate tier 1 or 2 data to tier 3 capacity between two InServs (application pods).

Remote Data Access: Replicate data to a remote location for sharing of data with remote users.

Todd Hoff's picture

Product: NetApp MetroCluster Software

NetApp MetroCluster Software Cost-effective is an integrated high-availability storage cluster and site failover capability.

NetApp MetroCluster is an integrated high-availability and disaster recovery solution that can reduce system complexity and simplify management while ensuring greater return on investment. MetroCluster uses clustered server technology to replicate data synchronously between sites located miles apart, eliminating data loss in case of a disruption. Simple and powerful recovery process minimizes downtime, with little or no user action required.

At one company I worked at they used the NetApp snap mirror feature to replicate data across long distances to multiple datacenters. They had a very fast backbone and it worked well. The issue with NetApp is always one of cost, but if you can afford it, it's a good option.

Todd Hoff's picture

Paper: Designing Disaster Tolerant High Availability Clusters

A very detailed (339 pages) paper on how to use HP products to create a highly available cluster. It's somewhat dated and obviously concentrates on HP products, but it is still good information.

Table of contents:

1. Disaster Tolerance and Recovery in a Serviceguard Cluster
2. Building an Extended Distance Cluster Using ServiceGuard
3. Designing a Metropolitan Cluster
4. Designing a Continental Cluster
5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP
6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF
7. Cascading Failover in a Continental Cluster

Evaluating the Need for Disaster Tolerance
What is a Disaster Tolerant Architecture?
Types of Disaster Tolerant Clusters

Extended Distance Clusters
Metropolitan Cluster
Continental Cluster
Continental Cluster With Cascading Failover

Disaster Tolerant Architecture Guidelines

Protecting Nodes through Geographic Dispersion
Protecting Data through Replication
Using Alternative Power Sources
Creating Highly Available Networking
Disaster Tolerant Cluster Limitations

Managing a Disaster Tolerant Environment
Using this Guide with Your Disaster Tolerant Cluster Products

2. Building an Extended Distance Cluster Using ServiceGuard

Types of Data Link for Storage and Networking
Two Data Center Architecture

Two Data Center FibreChannel Implementations
Advantages and Disadvantages of a Two-Data-Center Architecture

Three Data Center Architectures
Rules for Separate Network and Data Links
Guidelines on DWDM Links for Network and Data

3. Designing a Metropolitan Cluster

Designing a Disaster Tolerant Architecture for use with Metrocluster Products

Single Data Center
Two Data Centers and Third Location with Arbitrator(s)

Additional EMC SRDF Configurations

Setting up Hardware for 1 by 1 Configurations
Setting up Hardware for M by N Configurations

Worksheets

Disaster Tolerant Checklist
Cluster Configuration Worksheet
Package Configuration Worksheet

Next Steps

4. Designing a Continental Cluster

Understanding Continental Cluster Concepts

Mutual Recovery Configuration
Application Recovery in a Continental Cluster
Monitoring over a Wide Area Network
Cluster Events
Interpreting the Significance of Cluster Events
How Notifications Work
Alerts
Alarms
Creating Notifications for Failure Events
Creating Notifications for Events that Indicate a Return of Service
Performing Cluster Recovery
Notes on Packages in a Continental Cluster
How Serviceguard commands work in a Continentalcluster

Designing a Disaster Tolerant Architecture for use with Continentalclusters

Mutual Recovery
Serviceguard Clusters
Data Replication
Highly Available Wide Area Networking
Data Center Processes
Continentalclusters Worksheets

Preparing the Clusters

Setting up and Testing Data Replication
Configuring a Cluster without Recovery Packages
Configuring a Cluster with Recovery Packages

Building the Continentalclusters Configuration

Preparing Security Files
Creating the Monitor Package
Editing the Continentalclusters Configuration File
Checking and Applying the Continentalclusters Configuration
Starting the Continentalclusters Monitor Package
Validating the Configuration
Documenting the Recovery Procedure
Reviewing the Recovery Procedure

Testing the Continental Cluster

Testing Individual Packages
Testing Continentalclusters Operations

Switching to the Recovery Packages in Case of Disaster

Receiving Notification
Verifying that Recovery is Needed
Using the Recovery Command to Switch All Packages
How the cmrecovercl Command Works

Forcing a Package to Start
Restoring Disaster Tolerance

Restore Clusters to their Original Roles
Primary Packages Remain on the Surviving Cluster
Primary Packages Remain on the Surviving Cluster using cmswitchconcl
Newly Created Cluster Will Run Primary Packages
Newly Created Cluster Will Function as Recovery Cluster for All Recovery Groups

Maintaining a Continental Cluster

Adding a Node to a Cluster or Removing a Node from a Cluster
Adding a Package to the Continental Cluster
Removing a Package from the Continental Cluster
Changing Monitoring Definitions
Checking the Status of Clusters, Nodes, and Packages
Reviewing Messages and Log Files
Deleting a Continental Cluster Configuration
Renaming a Continental Cluster
Checking Java File Versions
Next Steps

Support for Oracle RAC Instances in a Continentalclusters Environment

Configuring the Environment for Continentalclusters to Support Oracle RAC
Initial Startup of Oracle RAC Instance in a Continentalclusters Environment
Failover of Oracle RAC Instances to the Recovery Site
Failback of Oracle RAC Instances After a Failover

5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP

Files for Integrating XP Disk Arrays with Serviceguard Clusters
Overview of Continuous Access XP Concepts

PVOLs and SVOLs
Device Groups and Fence Levels

Creating the Cluster
Preparing the Cluster for Data Replication

Creating the RAID Manager Configuration
Defining Storage Units

Configuring Packages for Disaster Recovery
Completing and Running a Metrocluster Solution with Continuous Access XP

Maintaining a Cluster that uses Metrocluster/CA
XP/CA Device Group Monitor

Completing and Running a Continental Cluster Solution with Continuous Access XP

Setting up a Primary Package on the Primary Cluster
Setting up a Recovery Package on the Recovery Cluster
Setting up the Continental Cluster Configuration
Switching to the Recovery Cluster in Case of Disaster
Failback Scenarios
Maintaining the Continuous Access XP Data Replication Environment

6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF

Files for Integrating ServiceGuard with EMC SRDF
Overview of EMC and SRDF Concepts
Preparing the Cluster for Data Replication

Installing the Necessary Software
Building the Symmetrix CLI Database
Determining Symmetrix Device Names on Each Node

Building a Metrocluster Solution with EMC SRDF

Setting up 1 by 1 Configurations
Grouping the Symmetrix Devices at Each Data Center
Setting up M by N Configurations
Configuring Serviceguard Packages for Automatic Disaster Recovery
Maintaining a Cluster that Uses Metrocluster/SRDF
Managing Business Continuity Volumes
R1/R2 Swapping

Building a Continental Cluster Solution with EMC SRDF

Setting up a Primary Package on the Primary Cluster
Setting up a Recovery Package on the Recovery Cluster
Setting up the Continental Cluster Configuration
Switching to the Recovery Cluster in Case of Disaster
Failback Scenarios
Maintaining the EMC SRDF Data Replication Environment
R1/R2 Swapping

7. Cascading Failover in a Continental Cluster

Overview

Symmetrix Configuration
Using Template Files

Data Storage Setup

Setting Up Symmetrix Device Groups
Setting up Volume Groups
Testing the Volume Groups

Primary Cluster Package Setup
Recovery Cluster Package Setup
Continental Cluster Configuration
Data Replication Procedures

Data Initialization Procedures
Data Refresh Procedures in the Steady State
Data Replication in Failover and Failback Scenarios

Todd Hoff's picture

Major Websites Down: Or Why You Want to Run in Two or More Data Centers.

A lot of sites hosted in San Francisco are down because of at least 6 back-to-back power outages power outages. More details at laughingsquid.

Sites like SecondLife, Craigstlist, Technorati, Yelp and all Six Apart properties, TypePad, LiveJournal and Vox are all down. The cause was an underground explosion in a transformer vault under a manhole at 560 Mission Street. Flames shot 6 feet out from the manhole cover. Over PG&E 30,000 customers are without power.

Syndicate content