High Availability on AWS: Designing for Zero Downtime
Definition
High Availability (HA) is the ability of a system to remain operational and accessible for a very high percentage of time, minimizing unplanned downtime. On AWS, HA is achieved by eliminating single points of failure through Multi-AZ deployments, load balancing, auto-scaling, health checks, and managed service HA features.
AWS expresses availability targets as percentages in its Service Level Agreements (SLAs): 99.9% (8.76 hours downtime/year), 99.99% (52.6 minutes/year), and 99.999% (5.26 minutes/year). Designing for higher tiers requires progressively more redundancy and automation.
How It Works
HA on AWS is built on the concept of redundancy across isolated failure domains — primarily Availability Zones (AZs). Each AZ is a physically separate data center cluster with independent power, cooling, and networking. By distributing your workload across multiple AZs, a single facility failure does not take your application offline.
The core HA pattern:
- Deploy across 2+ AZs within a Region.
- Place a load balancer (ALB/NLB) in front of compute to distribute traffic.
- Use Auto Scaling to replace failed instances and handle demand spikes.
- Implement health checks at the load balancer and application level.
- Use managed HA databases (RDS Multi-AZ, Aurora, DynamoDB) instead of self-managed databases.
- Decouple components with SQS, SNS, or EventBridge so one component's failure does not cascade.
AWS-managed services like S3 (99.999999999% durability, 99.99% availability SLA), DynamoDB (99.999% SLA for Global Tables), and Aurora (99.99% SLA) are inherently highly available — you get HA by default when using them.
Key Features and Limits
- SLA tiers: EC2 SLA is 99.99% per Region; S3 Standard is 99.99%; DynamoDB is 99.99% (99.999% for Global Tables); Route 53 is 100% SLA.
- Multi-AZ deployments: ELB, Auto Scaling Groups, RDS Multi-AZ, ElastiCache Multi-AZ, and EFS are all Multi-AZ capable.
- Health checks: ELB health checks (HTTP/TCP), Route 53 health checks, Auto Scaling health checks, and custom CloudWatch-based checks.
- RDS Multi-AZ: synchronous standby replica in another AZ; automatic failover in ~60 seconds; no read traffic on standby (use Read Replicas for that).
- Aurora HA: 6 copies of data across 3 AZs; automatic failover to read replica in ~30 seconds; Aurora Serverless v2 scales automatically.
- DynamoDB HA: data automatically replicated across 3 AZs; Global Tables add multi-Region replication.
- No single point of failure (SPOF): every layer — DNS, load balancer, compute, database, storage — must be redundant.
- Fault Injection Service (FIS): AWS service to test HA by injecting failures (terminate instances, throttle APIs, disrupt AZs) in a controlled way.
- Limits: HA within a single Region protects against AZ failures but not Region-wide outages. For Region-level resilience, you need disaster recovery (multi-Region).
Common Use Cases
- Web applications — ALB + Auto Scaling Group across 3 AZs + RDS Multi-AZ.
- Microservices — ECS/EKS services spread across AZs behind ALB, with SQS queues decoupling synchronous dependencies.
- Database tier — RDS Multi-AZ for relational workloads; Aurora for higher availability; DynamoDB for key-value with automatic HA.
- Stateless APIs — Lambda (inherently Multi-AZ) behind API Gateway (99.95% SLA).
- File storage — EFS (Multi-AZ by default) or S3 (11 nines durability, 4 nines availability).
- Chaos engineering — use FIS to validate that your HA architecture actually survives AZ failures before they happen in production.
Pricing Model
HA itself is not a line item, but the architectural patterns that enable it have cost implications:
- Multi-AZ deployments double some costs: RDS Multi-AZ costs roughly 2x a single-AZ instance because you pay for the standby.
- Cross-AZ data transfer: ~$0.01/GB between AZs in the same Region — adds up for data-heavy workloads.
- Load balancers: ALB charges per hour + per LCU (Load Balancer Capacity Unit); NLB charges per hour + per NLCU.
- Auto Scaling: no additional charge for the Auto Scaling service itself; you pay for the EC2 instances launched.
- Aurora: storage is replicated 6 ways automatically — you pay for storage consumed, not per replica copy.
- FIS: charged per action-minute during experiments.
The cost of HA is always weighed against the cost of downtime. For most production workloads, Multi-AZ is non-negotiable.
Pros and Cons
Pros
- AWS infrastructure (AZs, managed services) makes HA achievable without building your own data centers.
- Managed services like Aurora, DynamoDB, and S3 provide HA out of the box.
- Auto Scaling and ELB automate failure detection and recovery.
- SLAs with financial credits create accountability.
- FIS enables proactive HA validation.
Cons
- Multi-AZ deployments increase costs (roughly 2x for RDS Multi-AZ, cross-AZ data transfer fees).
- HA does not equal disaster recovery — a Regional outage can still take down a Multi-AZ deployment.
- Complexity increases with each layer of redundancy.
- Stateful components (databases, caches, sessions) are harder to make HA than stateless ones.
- Testing HA requires deliberate chaos engineering, which many teams skip.
Comparison with Alternatives
| | High Availability (Multi-AZ) | Disaster Recovery (Multi-Region) | Fault Tolerance | | --- | --- | --- | --- | | Scope | Single Region, multiple AZs | Multiple Regions | Component-level | | Protects against | AZ failure, instance failure | Region failure, widespread outage | Individual component failure | | Typical SLA target | 99.99% | 99.999%+ | Zero data loss, zero downtime | | Cost | Moderate (2x for some services) | High (full or partial stack in 2nd Region) | Highest (active-active everything) | | Complexity | Moderate | High | Very high | | Example | ALB + ASG + RDS Multi-AZ | Aurora Global DB + Route 53 failover | Real-time replication + instant failover |
For most workloads: start with Multi-AZ HA, add DR for critical systems, and reserve fault tolerance for the most demanding components.
Exam Relevance
- Cloud Practitioner (CLF-C02) — know that Multi-AZ provides HA, understand AZs as isolated data centers, and recognize that managed services like S3 and DynamoDB are inherently HA.
- Solutions Architect Associate (SAA-C03) — heavy coverage: design Multi-AZ architectures, choose between RDS Multi-AZ and Aurora, configure Auto Scaling with ELB, eliminate SPOFs. Classic question: "How do you make this architecture highly available?" → spread across AZs, add ELB, enable RDS Multi-AZ.
- Solutions Architect Professional (SAP-C02) — advanced: cross-Region HA with Aurora Global Database, Route 53 failover routing, and Global Accelerator.
Exam trap: RDS Multi-AZ standby is for failover only — it does not serve read traffic. For read scaling, use Read Replicas.
Frequently Asked Questions
Q: What is the difference between high availability and disaster recovery?
A: High availability protects against component and AZ failures within a single Region, targeting 99.99% or higher uptime. Disaster recovery protects against Region-level failures or catastrophic events by replicating workloads to a second Region. HA is your first line of defense (Multi-AZ); DR is your second (Multi-Region). Most production workloads need HA; mission-critical workloads also need DR.
Q: Does RDS Multi-AZ provide both high availability and read scaling?
A: No. RDS Multi-AZ maintains a synchronous standby replica in another AZ for automatic failover, but the standby does not serve read traffic. For read scaling, create Read Replicas (asynchronous, can serve reads). Aurora combines both: it has up to 15 read replicas that also serve as failover targets, providing HA and read scaling in one architecture.
Q: How do I test that my architecture is actually highly available?
A: Use AWS Fault Injection Service (FIS) to simulate failures in a controlled way — terminate EC2 instances, disrupt AZ connectivity, throttle API calls, or stress CPU/memory. FIS integrates with CloudWatch to monitor application behavior during experiments. You can also manually test by terminating instances and verifying that Auto Scaling replaces them and the load balancer routes around failures.
This article reflects AWS features and pricing as of 2026. AWS services evolve rapidly — always verify against the official AWS documentation before making production decisions.