Building Fault-Tolerant Applications On Aws

Below is my summary of the Amazon Web Services (AWS) whitepaper released on October 2011 yesteryear Jeff Barr, Attila Narin, in addition to Jinesh Varia.

AWS aims to simplify the chore of edifice in addition to maintaining fault-tolerant distributed systems/services for its customers. For this, AWS prescribes the customers to encompass the next philosophy when they are edifice their applications on AWS.

1. Make computing nodes disposable/easily replaceable

AWS (as amongst whatsoever cloud provider actually) employs a degree of indirection over the physical computer, called virtual machine (VM), to brand computing nodes easily replaceable. You in addition to then require a template to define your service instance over a VM, in addition to this is called Amazon Machine Image (AMI). The kickoff footstep towards edifice fault-tolerant applications on AWS is to do your ain AMIs. Starting your application in addition to then is merely a thing of launching VM instances on Amazon Elastic Compute Cloud (EC2) using your AMI. Once you lot receive got created an AMI, replacing a failing instance is rattling simple; you lot tin exactly launch a replacement instance that uses the same AMI every bit its template. This tin live done programmatically through an API invocation.

In short, AWS wants you lot to encompass an instance every bit the smallest unit of measurement of failure in addition to acquire inward easily replaceable. AWS helps you lot to automate in addition to brand this procedure to a greater extent than transparent yesteryear providing elastic IP addresses in addition to elastic charge balancing. To minimize downtime, you lot may move along a spare instance running, in addition to easily neglect over to this hot instance yesteryear rapidly remapping your elastic IP address to this novel instance. Elastic charge balancing farther facilitates this procedure yesteryear detecting unhealthy instances inside its puddle of Amazon EC2 instances in addition to automatically rerouting traffic to well for you lot instances, until the unhealthy instances receive got been restored.

2. Make state/storage persistent

When you lot launch replacement service instances on EC2 every bit above, you lot also require to supply persistent state/data that these instances receive got access to. Amazon Elastic Block Store (EBS) provides block degree storage volumes for work amongst Amazon EC2 instances. Amazon EBS volumes are off-instance storage that persists independently from the life of an instance. Any information that needs to persist should live stored on Amazon EBS volumes, non on the local hard-disk associated amongst an Amazon EC2 instance because that disappears when the instance die. If the Amazon EC2 instance fails in addition to needs to live replaced, the Amazon EBS book tin merely live attached to the novel Amazon EC2 instance. EBS volumes shop information redundantly, making them to a greater extent than durable than a typical difficult drive. To farther mitigate the possibility of a failure, backups of these volumes tin live created using a characteristic called snapshots.

Of course of pedagogy this begs the query of "what is the sweet-point inward storing to EBS" every bit it comes amongst a pregnant penalisation over EC2 RAM, in addition to a slight penalisation over EC2 disk. I gauge this depends on how you lot tin stretch the Definition of "the information that needs to persist" inward your application.

3. Rejuvenate your organization yesteryear replacing instances

If you lot follow the kickoff 2 principles, you lot tin (and should) rejuvenate your organization yesteryear periodically replacing onetime instances amongst novel server instances transparently. This ensures that whatsoever potential degradation (software retention leaks, resources leaks, hardware degradation, filesystem fragmentation, etc.) does non adversely deport upon your organization every bit a whole.

4. Use georeplication to orbit disaster tolerance

Amazon Web Services are available inward 8 geographic "regions". Regions consist of i or to a greater extent than Availability Zones (AZ), are geographically dispersed, in addition to are inward dissever geographic areas or countries. The Amazon EC2 service degree understanding commitment is 99.95% availability for each Amazon EC2 Region. But inward guild to orbit the same availability inward your application, you lot should deploy your application over multiple availability zones, for instance yesteryear maintaining a fail-over site inward to a greater extent than or less other AZ every bit inward the figure.

5. Leverage other Amazon Web Services every bit fault-tolerant edifice blocks

Amazon Web Services offers a seat out of other services (Amazon Simple Queue Service, Amazon Simple Storage Service, Amazon SimpleDB, in addition to Amazon Relational Database Service.) that tin live incorporated into your application development. These services are fault-tolerant, thence yesteryear using them, you lot volition live increasing the mistake tolerance of your ain applications.

Let's receive got the Amazon Simple Queue Service (SQS) example. SQS is a highly reliable distributed messaging organization that tin serve every bit the backbone of your fault-tolerant application. Once a message has been pulled yesteryear an instance from an SQS queue, it becomes invisible to other instances (consumers) for a configurable fourth dimension interval known every bit a visibility timeout. After the consumer has processed the message, it must delete the message from the queue. If the fourth dimension interval specified yesteryear the visibility timeout has passed, but the message isn't deleted, it is in i trial once again visible inward the queue in addition to to a greater extent than or less other consumer volition live able to describe in addition to procedure it. This two-phase model ensures that no queue items are lost if the consuming application fails piece it is processing a message. Even inward an extreme instance where all of the worker processes receive got failed, Amazon SQS volition merely shop the messages for upwards to 4 days.

Conclusions:

chaos monkey tool to constantly in addition to unpredictably kill to a greater extent than or less of its service instances inward an try to enforce that their services are built inward a fault-tolerant in addition to resilient fashion in addition to expose in addition to resolve hidden problems amongst their services.

OK, subsequently reading this, you lot tin say that at to a greater extent than or less degree the cloud computing fault-tolerance is boring: the prescribed mistake correction activeness is to exactly supersede the failed instance amongst a novel instance. And if you lot say this, this volition brand the AWS folks happy, because this is the destination that they essay to attain. They desire to brand faults uninteresting, in addition to automatically dealt with. Unfortunately, non all faults tin live isolated at the instance level, the existent the world isn't that simple. There are many unlike types of faults, such every bit misconfigurations, unanticipated faults, application-level heisenbugs, in addition to bohrbugs, that won't agree into this mold. I recollect to investigate these remaining nontrivial types of faults, in addition to how to bargain amongst them. I am especially interested inward exploring what purpose tin self-stabilization play here. Another point, that didn't acquire coverage inward this whitepaper is virtually how to reveal faults in addition to unhealthy instances. I would live interested to acquire what techniques are employed inward exercise yesteryear AWS applications for this.