NetApp MetroCluster Software Cost-effective is an integrated high-availability storage cluster and site failover capability. NetApp MetroCluster is an integrated high-availability and disaster recovery solution that can reduce system complexity and simplify management while ensuring greater return on investment. MetroCluster uses clustered server technology to replicate data synchronously between sites located miles apart, eliminating data loss in case of a disruption. Simple and powerful recovery process minimizes downtime, with little or no user action required. At one company I worked at they used the NetApp snap mirror feature to replicate data across long distances to multiple datacenters. They had a very fast backbone and it worked well. The issue with NetApp is always one of cost, but if you can afford it, it's a good option.
A very detailed (339 pages) paper on how to use HP products to create a highly available cluster. It's somewhat dated and obviously concentrates on HP products, but it is still good information. Table of contents: 1. Disaster Tolerance and Recovery in a Serviceguard Cluster 2. Building an Extended Distance Cluster Using ServiceGuard 3. Designing a Metropolitan Cluster 4. Designing a Continental Cluster 5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP 6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF 7. Cascading Failover in a Continental Cluster Evaluating the Need for Disaster Tolerance What is a Disaster Tolerant Architecture? Types of Disaster Tolerant Clusters Extended Distance Clusters Metropolitan Cluster Continental Cluster Continental Cluster With Cascading Failover Disaster Tolerant Architecture Guidelines Protecting Nodes through Geographic Dispersion Protecting Data through Replication Using Alternative Power Sources Creating Highly Available Networking Disaster Tolerant Cluster Limitations Managing a Disaster Tolerant Environment Using this Guide with Your Disaster Tolerant Cluster Products 2. Building an Extended Distance Cluster Using ServiceGuard Types of Data Link for Storage and Networking Two Data Center Architecture Two Data Center FibreChannel Implementations Advantages and Disadvantages of a Two-Data-Center Architecture Three Data Center Architectures Rules for Separate Network and Data Links Guidelines on DWDM Links for Network and Data 3. Designing a Metropolitan Cluster Designing a Disaster Tolerant Architecture for use with Metrocluster Products Single Data Center Two Data Centers and Third Location with Arbitrator(s) Additional EMC SRDF Configurations Setting up Hardware for 1 by 1 Configurations Setting up Hardware for M by N Configurations Worksheets Disaster Tolerant Checklist Cluster Configuration Worksheet Package Configuration Worksheet Next Steps 4. Designing a Continental Cluster Understanding Continental Cluster Concepts Mutual Recovery Configuration Application Recovery in a Continental Cluster Monitoring over a Wide Area Network Cluster Events Interpreting the Significance of Cluster Events How Notifications Work Alerts Alarms Creating Notifications for Failure Events Creating Notifications for Events that Indicate a Return of Service Performing Cluster Recovery Notes on Packages in a Continental Cluster How Serviceguard commands work in a Continentalcluster Designing a Disaster Tolerant Architecture for use with Continentalclusters Mutual Recovery Serviceguard Clusters Data Replication Highly Available Wide Area Networking Data Center Processes Continentalclusters Worksheets Preparing the Clusters Setting up and Testing Data Replication Configuring a Cluster without Recovery Packages Configuring a Cluster with Recovery Packages Building the Continentalclusters Configuration Preparing Security Files Creating the Monitor Package Editing the Continentalclusters Configuration File Checking and Applying the Continentalclusters Configuration Starting the Continentalclusters Monitor Package Validating the Configuration Documenting the Recovery Procedure Reviewing the Recovery Procedure Testing the Continental Cluster Testing Individual Packages Testing Continentalclusters Operations Switching to the Recovery Packages in Case of Disaster Receiving Notification Verifying that Recovery is Needed Using the Recovery Command to Switch All Packages How the cmrecovercl Command Works Forcing a Package to Start Restoring Disaster Tolerance Restore Clusters to their Original Roles Primary Packages Remain on the Surviving Cluster Primary Packages Remain on the Surviving Cluster using cmswitchconcl Newly Created Cluster Will Run Primary Packages Newly Created Cluster Will Function as Recovery Cluster for All Recovery Groups Maintaining a Continental Cluster Adding a Node to a Cluster or Removing a Node from a Cluster Adding a Package to the Continental Cluster Removing a Package from the Continental Cluster Changing Monitoring Definitions Checking the Status of Clusters, Nodes, and Packages Reviewing Messages and Log Files Deleting a Continental Cluster Configuration Renaming a Continental Cluster Checking Java File Versions Next Steps Support for Oracle RAC Instances in a Continentalclusters Environment Configuring the Environment for Continentalclusters to Support Oracle RAC Initial Startup of Oracle RAC Instance in a Continentalclusters Environment Failover of Oracle RAC Instances to the Recovery Site Failback of Oracle RAC Instances After a Failover 5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP Files for Integrating XP Disk Arrays with Serviceguard Clusters Overview of Continuous Access XP Concepts PVOLs and SVOLs Device Groups and Fence Levels Creating the Cluster Preparing the Cluster for Data Replication Creating the RAID Manager Configuration Defining Storage Units Configuring Packages for Disaster Recovery Completing and Running a Metrocluster Solution with Continuous Access XP Maintaining a Cluster that uses Metrocluster/CA XP/CA Device Group Monitor Completing and Running a Continental Cluster Solution with Continuous Access XP Setting up a Primary Package on the Primary Cluster Setting up a Recovery Package on the Recovery Cluster Setting up the Continental Cluster Configuration Switching to the Recovery Cluster in Case of Disaster Failback Scenarios Maintaining the Continuous Access XP Data Replication Environment 6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF Files for Integrating ServiceGuard with EMC SRDF Overview of EMC and SRDF Concepts Preparing the Cluster for Data Replication Installing the Necessary Software Building the Symmetrix CLI Database Determining Symmetrix Device Names on Each Node Building a Metrocluster Solution with EMC SRDF Setting up 1 by 1 Configurations Grouping the Symmetrix Devices at Each Data Center Setting up M by N Configurations Configuring Serviceguard Packages for Automatic Disaster Recovery Maintaining a Cluster that Uses Metrocluster/SRDF Managing Business Continuity Volumes R1/R2 Swapping Building a Continental Cluster Solution with EMC SRDF Setting up a Primary Package on the Primary Cluster Setting up a Recovery Package on the Recovery Cluster Setting up the Continental Cluster Configuration Switching to the Recovery Cluster in Case of Disaster Failback Scenarios Maintaining the EMC SRDF Data Replication Environment R1/R2 Swapping 7. Cascading Failover in a Continental Cluster Overview Symmetrix Configuration Using Template Files Data Storage Setup Setting Up Symmetrix Device Groups Setting up Volume Groups Testing the Volume Groups Primary Cluster Package Setup Recovery Cluster Package Setup Continental Cluster Configuration Data Replication Procedures Data Initialization Procedures Data Refresh Procedures in the Steady State Data Replication in Failover and Failback Scenarios
lighttpd (pronounced "lighty") is a web server which is designed to be secure, fast, standards-compliant, and flexible while being optimized for speed-critical environments. Its low memory footprint (compared to other web servers), light CPU load and its speed goals make lighttpd suitable for servers that are suffering load problems, or for serving static media separately from dynamic content. lighttpd is free software / open source, and is distributed under the BSD license. lighttpd runs on GNU/Linux and other Unix-like operating systems and Microsoft Windows. * Load-balancing FastCGI, SCGI and HTTP-proxy support * chroot support * select()-/poll()-based web server * Support for more efficient event notification schemes like kqueue and epoll * Conditional rewrites (mod_rewrite) * SSL and TLS support, via openSSL. * Authentication against an LDAP server * rrdtool statistics * Rule-based downloading with possibility of a script handling only authentication * Server-side includes support * Flexible virtual hosting * Modules support * Cache Meta Language (currently being replaced by mod_magnet) * Minimal WebDAV support * Servlet (AJP) support (in versions 1.5.x and up) * HTTP compression using mod_compress and the newer mod_deflate ( 1.5.x )
Information Sources* http://en.wikipedia.org/wiki/Lighttpd * http://highscalability.com/paper-lightweight-web-servers
This paper is a great overview of different lightweight web servers. A lot of websites use lightweight web servers to serve images and static content. YouTube is one example: http://highscalability.com/youtube-architecture. So if you need to improve performance consider changing over a different web server for some types of content. Overview: Recent years have enjoyed a florescence of interesting implementations of Web servers, including lighttpd, litespeed, and mongrel, among others. These Web servers boast different combinations of performance, ease of administration, portability, security, and related values. The following engineering study surveys the field of lightweight Web servers to help you find one likely to meet the technical requirements of your next project. "Lightweight" Web servers like lighttpd, litespeed, and mongrel can offer dramatic benefits for your projects. This article surveys the possibilities and shows how they apply to you. Important dimensions for evaluation of a Web server include: * Performance: How fast does it respond to requests? * Scalability: Does the server continue to behave reliably when many users simultaneously access it? * Security: Does the server do only the operations it should? What support does it offer for authenticating users and encrypting its traffic? Does its use make nearby applications or hosts more vulnerable? * Availability: What are the failure modes and incidences of the server? * Compliance to standards: Does the server respect the pertinent RFCs? * Flexibility: Can the server be tuned to accommodate heavy request loads, or computationally demanding dynamic pages, or expensive authentication, or ...? * Platform requirements: On what range of platforms is the server available? Does it have specific hardware needs? * Manageability: Is the server easy to set up and maintain? Is it compatible with organizational standards for logging, auditing, costing, and so on?
A lot of sites hosted in San Francisco are down because of at least 6 back-to-back power outages power outages. More details at laughingsquid. Sites like SecondLife, Craigstlist, Technorati, Yelp and all Six Apart properties, TypePad, LiveJournal and Vox are all down. The cause was an underground explosion in a transformer vault under a manhole at 560 Mission Street. Flames shot 6 feet out from the manhole cover. Over PG&E 30,000 customers are without power. What's perplexing is the UPS backup and diesel generators didn't kick in to bring the datacenter back on line. I've never toured that datacenter, but they usually have massive backup systems. It's probably one of those multiple simultaneous failure situations that you hope never happen in real life, but too often do. Or maybe the infrastructure wasn't rolled out completely. Update: the cause was a cascade of failures in a tightly couples system that could never happen :-) Details at Failure Happens: A summary of the power outage at 365 Main. It's just these sorts of emergencies that make us think. How would You handle a similar failure? How can you make your website span more than one data center? Is the cost of putting in all this infrastructure worth it compared to your website being down for a day? All good hard to answer and even harder to implement questions. Geo-distributed clustering is not easy, which is why most companies don't do it, but for some help distributing your website take a look at http://highscalability.com/tags/geo-distributed-clusters.
If you want to adopt a shard architecture, but don't want to start from scratch, you may want to consider Hibernate's sharding system. Hibernate Shards is a framework that is designed to encapsulate and minimize this complexity by adding support for horizontal partitioning to Hibernate Core. Hibernate Shards key features: * Standard Hibernate programming model - Hibernate Shards allows you to continue using the Hibernate APIs you know and love: SessionFactory, Session, Criteria, Query. If you already know how to use Hibernate, you already know how to use Hibernate Shards. * Flexible sharding strategies - Distribute data across your shards any way you want. Use one of the default strategies we provide or plug in your own application-specific logic. * Support for virtual shards - Think your sharding strategy is never going to change? Think again. Adding new shards and redistributing your data is one of the toughest operational challenges you will face once you've deployed your shard-aware application. Hibernate Sharding supports virtual shards, a feature designed to simplify the process of resharding your data. * Free/open source - Hibernate Shards is licensed under the LGPL (Lesser GNU Public License)
Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it? Site: http://www.google.com/talk
The Architecture* Data Center. * Storage. * Development Environment. * OS. * Web Server. * Database. * Database abstraction layer. * Load balancing. * Web Framework. * Real-time messaging. * Identity management. * Distributed job management. * Ad serving. * Standard API to website. * AJAX library. * PHP Cache. * Object and Content Cache. * Client Side Cache. * Monitoring. * Log Analysis. * Testing. * Performance Analysis. * Backup and Restore. * Fault Tolerance. * Scalability Plan. * Business Continuity Plan. * Future Directions.
Lessons LearnedTo discuss this article please visit the forums at
As users come to depend on MySQL, they find that they have to deal with issues of reliability, scalability, and performance--issues that are not well documented but are critical to a smoothly functioning site. This book is an insider's guide to these little understood topics. Author Jeremy Zawodny has managed large numbers of MySQL servers for mission-critical work at Yahoo!, maintained years of contacts with the MySQL AB team, and presents regularly at conferences. Jeremy and Derek have spent months experimenting, interviewing major users of MySQL, talking to MySQL AB, benchmarking, and writing some of their own tools in order to produce the information in this book. In High Performance MySQL you will learn about MySQL indexing and optimization in depth so you can make better use of these key features. You will learn practical replication, backup, and load-balancing strategies with information that goes beyond available tools to discuss their effects in real-life environments. And you'll learn the supporting techniques you need to carry out these tasks, including advanced configuration, benchmarking, and investigating logs. Topics include: * A review of configuration and setup options * Storage engines and table types * Benchmarking * Indexes * Query Optimization * Application Design * Server Performance * Replication * Load-balancing * Backup and Recovery * Security