VGhlIOKAnEZvdXIgSGFtaWx0b25z4oCdIEZyYW1ld29yayBmb3IgTWl0aWdhdGluZyBGYXVsdHMg aW4gdGhlIENsb3VkOiBBdm9pZCBpdCwgTWFzayBpdCwgQm91bmQgaXQsIEZpeCBpdCBGYXN0

This is a guest post by Patrick Eaton, Software Engineer and Distributed Systems Architect at Stackdriver.

Stackdriver provides intelligent monitoring-as-a-service for cloud hosted applications.  Behind this easy-to-use service is a large distributed system for collecting and storing metrics and events, monitoring and alerting on them, analyzing them, and serving up all the results in a web UI.  Because we ourselves run in the cloud (mostly on AWS), we spend a lot of time thinking about how to deal with faults in the cloud.  We have developed a framework for thinking about fault mitigation for large, cloud-hosted systems.  We endearingly call this framework the “Four Hamiltons” because it is inspired by an article from James Hamilton, the Vice President and Distinguished Engineer at Amazon Web Services.

The article that led to this framework is called “The Power Failure Seen Around the World”.  Hamilton analyzes the causes of the power outage that affected Super Bowl XLVII in early 2013.  In the article, Hamilton writes:

As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.

The mitigation options are roughly ordered by increasing impact to the customer.  In this article, we will refer to these strategies, in order, as “Avoid it”, “Mask it”, “Bound it”, and “Fix it fast”.

While Hamilton enumerates techniques for dealing with faults, the architectures of the public clouds determine the types of system faults that we should expect and for which we should design solutions.  Amazon Web Services (AWS) and Google Cloud expose three fault domains.  At the smallest scale, an “instance” is a single virtual machine that fails independently from other virtual machines.  At the largest scale, a “region” is a geographic area, often with multiple data centers, that fails independently from other regions.  In between, a region is comprised of multiple “zones”, each of which is typically a single data center and which is isolated from other zones in the region.  (Rackspace also exposes multiple geographic regions, but they claim their data center network is always available and thus the concept of zones are not needed.)

Combining the fault domains and the techniques for addressing faults in those domains, we can produce a grid, as shown below, to guide us as we consider how to survive faults in cloud-hosted applications.

In the sections below, we discuss each strategy and how it can be applied to the fault domains in the cloud.  We also highlight how we employ these strategies in our system at Stackdriver.

Avoid It

The first and most obvious strategy for surviving failures is to avoid them altogether.  Faults that never happen are the easiest ones to fix and have no impact the user experience.

Traditionally, avoiding failures required high-quality hardware--reliable, enterprise-grade servers, disks, and networking gear.  The cloud, however, is built from commodity components, and cloud infrastructure providers do not make any guarantees about the reliability of the hardware.  Consequently, as system designers and application builders, we have little control over the reliability of the virtual hardware we provision.  What we can control, however, is the software.  Fault avoidance in the cloud depends on solid system architecture and disciplined software engineering.

To avoid failures at the instance level, write high-quality software and test it.  Manage software complexity.  Choose high quality software components (web server, database, etc.) and libraries, preferably ones that have been battle tested.  Track configuration carefully.  Automate management to avoid error-prone, manual changes.

An application builder cannot generally avoid faults in the bigger fault domains (regions and zones) because they have no control over those domains.

There is, however, a way to avoid some of the faults--let someone else handle them.  Some cloud providers offer hosted services that claim to protect against zone or region outages.  Certainly, components of these hosted services do fail.  The key point, though, is that, from the perspective of the application builder, the hosted service provides the abstraction of being reliable because the cloud provider masks the faults in those services.

An example of a hosted service that avoids faults is AWS’s Relational Database Service (RDS).  It offers a multi-AZ (zone) deployment option that includes a synchronous replica with automatic failover that can mask zone failures, making it appear to be a reliable database to the application builder.  Another example is AWS’s Simple Queue Service (SQS).  Documentation suggests that it stores copies of all messages across data centers (in a single region), and so, from the perspective of the system builder, it is reliable in the presence of instance and zone failures.  As for the largest fault domain, AWS’s Route 53 hosted DNS service supports failover between regions to provide the appearance of reliable name resolution, even in the presence of region failures.

Stackdriver’s Approach

To avoid software errors and support the goal of high-quality software, we use Rollbar (https://rollbar.com/) for software error monitoring and reporting.  When our software fails, a message is sent to the Rollbar API to record the error.  The bugs that trigger these reports are quickly squashed to avoid possible future impact.

To avoid configuration errors at the instance level, we use Puppet (http://puppetlabs.com/) to ensure that all instances are running known, tested, stable configuration.  With Puppet, we define the software that should be running on each instance, the versions of the software, and any configuration for that software.  We can provision and deploy new instances without introducing errors (even when responding quickly during an incident).

To avoid errors at the zone and region level, we make heavy use of hosted services that provide reliable service despite component failures in the hosted service.  For relational databases, we use AWS RDS with multi-zone replication to avoid instance and zone failures.  For reliable load balancing, we use AWS ELB to protect against instance and zone failures.  Finally, for much of our low-volume messaging, we use AWS SQS as a reliable queueing service that is not affected by instance of zone failures.

Mask It

If you cannot avoid a fault, the next best alternative is to mask the fault with redundancy.  Masked failures generally do not affect availability, though the performance of the service may change incrementally.  For example, after a failure, request latency may increase.

There are several techniques for providing redundancy (which we refer to generically as replication), and thus solutions that provide masking can take several forms.  Some systems fully replicate functionality across multiple peers to allow any resource to handle any request.  Other systems use a single master or primary node with one or more slave or secondary nodes with automatic failover when the master fails.  Still other systems use a clustering approach with quorum determining valid state values.

To mitigate failures of individual instance, you must ensure that a replica is online to take over when an instance fails.  To mitigate the failure of a zone, you must ensure that all application services are distributed across more than one zone.  Also, you must be sure that the service is sufficiently over-provisioned and distributed to handle increased load in healthy zones the event of a zone failure. Similarly, handling the failure of a region requires that all services be distributed across more than one region and that capacity is sufficient to handle the load when one region is unavailable.

Stackdriver’s Approach:

Many services in our system employ typical scale-out design strategies.  For masking both instance and zone failures, some services use multiple workers consuming messages from a single AWS SQS queue.  When an instance hosting a worker fails, the failed worker stops consuming messages while the other workers continue process requests.  Going one level deeper, the SQS service masks failures so as to appear reliable to the application builder.

Other services mask instance and zone failures by deploying multiple workers behind a AWS Elastic Load Balancer (ELB).  When an instance hosting a worker fails or becomes unavailable due to a zone failure, the load balancer removes the worker and continues distributing load among the remaining workers.  Going on level deeper, the ELB hosted service provides the abstraction of a reliable load balancer.

We use Cassandra as our primary data store for time series data that must be accessed at interactive latencies.  The basic design of Cassandra ensures that survives instance failures.  Further, our Cassandra deployment spans multiple zones in a single region.  It is configured with a replication level and strategy that allows it to survive the failure of an entire zone.  Each key is replicated three times in such a way that no two replicas are hosted in the same zone, via the AWS-aware Ec2Snitch for replica placement.

To support our endpoint monitoring feature (to measure the availability and latency of customers’ Internet-facing services), we run services to initiate the probe requests.  We deploy probes across multiple regions and even multiple providers to ensure that probes can query endpoints even in the event of the failure of a region.  By using multiple providers, this feature could even tolerate a complete outage by a single provider.

Bound It

Unfortunately, it is not always possible to avoid or mask failures.  When an incident cannot be masked, you can try to minimize the scope of the failure.  While customers may be impacted by the incident, this strategy considers ways to reduce that impact.

Failures at any level--instance, zone, or region--can cause problems for which bounding the impact is the best mitigation strategy.  While the design and engineering effort varies depending on the type of failure that is being limited, the choices you should consider are similar.  Typical choices are detailed below.

Limit the impact to only some customers.  The most common approach for this strategy is sharding.  With sharding, different customers are served by resources in different fault domains.  When resources fail, only those customers hosted on the failed resources are affected.

Prioritize certain classes of requests.  If the failure is related to limited capacity, you can give preference to certain types of requests.  You could prioritize traffic from paying customers over trial customers.  You could prioritize high-profit requests over low-profit requests.  You could prioritize low-latency requests over long-latency requests.  Shedding low priority traffic can ease load to allow for fixes to the system.

Degrade service gracefully.  If properly architected, the subsystems of an application should be independent, as much as possible.  In that case, large parts of the application can continue to function, even when other features are failing.  For example, the app could provide browse functionality, even if search is not working.  The app could allow downloads, even if uploads are not possible.  The app could allow purchases even if feedback features are not available.

Stackdriver’s Approach

To reduce the load on the system when customers view data over long intervals, the system runs batch aggregation jobs to precompute roll-ups (various functions over multiple time scales, e.g. the average over 30 minute intervals).  The aggregation function can fall behind due to reduced capacity because of instance or zone failure.   (Stackdriver currently runs the primary data pipeline in a single region.)  If the aggregation process falls behind, the service that retrieves data for the graphs can aggregate data dynamically from the raw data.  The cost of dynamic aggregation is additional load on the system and latency for the customers, but the long-range graphs remain available.

The Stackdriver design has different pipelines for data processing and alerting. The pipelines operate independently.   Data processing can be delayed due to failures of an instance or zone.  If data processing is delayed, the alerting pipeline continues to function unimpeded.  While losing the ability to view the most current data in the dashboard is undesirable, the alerting features, which are more important to customers, continue working.

Even within the data processing pipeline, different functions can operate independently. The data processing pipeline archives data to a durable object store and indexes them into Cassandra or ElasticSearch.  These different functions can be impacted by instance or zone failures.  If, for example, ElasticSearch is unavailable, the system still pushes data to the archive and Cassandra.  When ElasticSearch is available, the delayed data will be flushed through the pipeline.  Note also that the archival pipeline is designed to be as simple as possible since it is among the most critical functions in the system.

Fix It Fast

In the worst case, a failure happens and you cannot hide from it.  There is no sugar-coating it--the service is down.  Your users see the outage and cannot use your service.  Your best option is to fix it...fast!

You should not rely on this strategy very often.  If you are, you are “doing it wrong” and will have significant customer-visible downtime.

There are several design and engineering approaches that can help you fix problems fast.

Revert code.  This approach is useful for instance failures due to a single faulty commit.  Simply revert the commit.  Re-deploy the last known working code.  Recovery can be completed in the time it takes to deploy code.

Provision and deploy new instances.  This approach can address instance, zone, or region failures.  If the failure is due to resources crashing, dying, or otherwise becoming unavailable or unreachable, replace them.  If the lost resources are stateless, then replacing the failed resources should restore service.  Recovery can be completed in the time it takes to provision and configure new resources.

Restore from replicas.  This approach, again, can address instance, zone, or region failures.  To handle zone and region failures, replicas must, obviously, be located in outside the impacted fault domain.  If the failure is due to lost state, then restore the state from replicas or backups.  If replicas are kept on-line (as in a database slave), recovery can be completed just by promoting the slave.  If the backup is not online, recovery will be delayed while the state can be recovery from backup.

Document procedures recovering from failures.  Identify the biggest threats to your system and predict their impact.  Think through a recovery plan for those scenarios.

Practice disaster recovery.  You probably do not want to bring down your whole application for a fire drill, but you should induce failures and practice recovery regularly.  Such exercises will help you test your designs for fault tolerance, recovery procedures, and tools for recovery.  The tests will help you find ways to recover faster from real failures in the future.

Stackdriver’s Approach

Thankfully, we do not have too many stories to tell to illustrate this mitigation strategy.  Faults occur in our system as with any other system, but most of the faults are handled via masking and bounding, allowing us to make the fixes in the background without affecting our customers.  Our automation systems allow us to revert code or provision new instances quickly, if needed.  Typically, however, we have enough resources online to mask problems.   Also, we have designed responses for several disaster scenarios.

We have had one major incident that tested our recovery plans.  The fault domain was not one of the ones we have so far considered; it was a failure of one of our core subsystems.  In October, 2013, the Cassandra cluster that was storing and serving the time series measurement data failed catastrophically, due to our error.  When we realized we would not be able to recover the cluster, we began to execute previously-discussed plans to provision a new Cassandra cluster and re-populate it with historical data.  You can read more about this incident in the published post-mortem.  While the recovery still took hours, the procedures we had previously designed and the automation we had previously built were key to recovering as quickly as we did.  For example, we had developed scripts to provision and deploy a new Cassandra cluster automatically.  Also, we built many components with the ability recover their state from historical data archives.

Application

Use this framework to help you think about failures that can happen in your system and how the system currently responds.  Evaluate your system by iterating through the system components and compare your fault mitigation strategy against the Four Hamiltons.  If you rely on “fix it fast!” too often, you are building a system designed for frequent downtime.  Improve the reliability of your system by upgrading your fault mitigation strategy.

Admittedly, there is cost associated with implementing each of these strategies.  And typically, the cost is higher for the strategies that have less customer impact.  Pragmatically, you may decide to use a mitigation strategy that has a bigger customer impact due to the trade-offs in development costs, operation costs, outage costs.

Conclusion

Delivering large, highly-available services in the public clouds is a relatively new art--best practices are still emerging.  Designing the fault-tolerant distributed systems that provide always-on services requires understanding of the types of faults that threaten a system and the techniques for mitigating faults. James Hamilton gives us a list of fault mitigation strategies: 1) avoid the fault (avoid it!) 2) mask the fault with redundancy (mask it!) 3) minimize the fault impact with small fault zones (bound it!) 4) minimize the fault impact with fast recovery (fix it fast!).  The public clouds themselves define the fault domains that must be considered: instance, zone, and region.  Combining these list provides a rich framework for considering alternatives for handling failures in these big systems.  Applying this framework can make your system more reliable, which makes your customers happy and you well-rested.