AWS Disaster Recovery: Strategies, RTO/RPO, and Key Services

Definition

Disaster Recovery (DR) on AWS is the process of preparing for and recovering from events that make your primary workload unavailable — whether an AZ outage, Region-level failure, data corruption, or human error. AWS defines four DR strategies of increasing cost and speed: Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active.

Two metrics govern every DR plan: Recovery Time Objective (RTO) — how quickly you must be back online — and Recovery Point Objective (RPO) — how much data loss is acceptable, measured in time since the last good backup or replication point.

How It Works

The 4 DR Strategies

1. Backup & Restore (lowest cost, highest RTO/RPO)

Back up data to S3, use AWS Backup for automated snapshots of EBS, RDS, DynamoDB, and EFS.
In a disaster, restore from backups and rebuild infrastructure (via CloudFormation/CDK).
RTO: hours. RPO: hours (depends on backup frequency).

2. Pilot Light (low cost, moderate RTO)

Keep a minimal version of the environment running in the DR Region — core database replication only (RDS cross-Region Read Replica, Aurora Global Database).
Compute and application layers are pre-configured but not running.
In a disaster, scale up compute, promote the DB replica, and switch DNS.
RTO: tens of minutes. RPO: seconds to minutes (continuous replication).

3. Warm Standby (moderate cost, lower RTO)

A scaled-down but fully functional copy of the production environment runs in the DR Region.
All layers (compute, DB, networking) are active but at reduced capacity.
In a disaster, scale up to full production capacity and switch DNS.
RTO: minutes. RPO: seconds (near-real-time replication).

4. Multi-Site Active/Active (highest cost, near-zero RTO/RPO)

Full production capacity runs in 2+ Regions simultaneously, serving traffic from both.
Route 53 distributes traffic via latency-based or weighted routing.
Aurora Global Database or DynamoDB Global Tables handle data replication.
In a disaster, Route 53 health checks automatically route all traffic to the healthy Region.
RTO: near zero (automatic). RPO: near zero (synchronous or near-synchronous replication).

Key Features and Limits

AWS Backup: centralized, policy-driven backup service supporting 15+ AWS services; cross-Region and cross-account backup vaults; retention policies and compliance controls (Vault Lock).
AWS Elastic Disaster Recovery (DRS): continuous block-level replication of on-premises or cloud servers to AWS; automated failover and failback; replaces the former CloudEndure Disaster Recovery.
Route 53 failover routing: health-check-based DNS failover between primary and DR endpoints.
Aurora Global Database: sub-second replication across Regions; promotes a secondary Region to read-write in under a minute.
DynamoDB Global Tables: multi-Region, multi-active replication with single-digit millisecond latency.
S3 Cross-Region Replication (CRR): automatic asynchronous replication of objects to a bucket in another Region.
CloudFormation StackSets: deploy identical infrastructure across Regions and accounts for DR consistency.
Limits: Multi-Region DR adds cost (compute, storage, data transfer), complexity (data consistency, conflict resolution), and operational burden (testing, runbooks). Active/Active requires careful handling of write conflicts.

Common Use Cases

Compliance-mandated DR — financial services and healthcare regulations often require documented DR plans with tested RTO/RPO.
Business continuity for SaaS platforms — customers expect uptime SLAs that only Warm Standby or Active/Active can deliver.
Protection against data corruption — Backup & Restore with point-in-time recovery protects against accidental deletes and ransomware.
On-premises to AWS DR — use Elastic Disaster Recovery to replicate on-prem servers to AWS as a DR site.
Multi-Region global applications — Active/Active serves users from the nearest Region and provides DR as a side effect of the architecture.

Pricing Model

DR costs scale with the strategy:

Backup & Restore: S3 storage (~$0.023/GB/month for Standard), EBS snapshot storage (~$0.05/GB/month), AWS Backup service charges per backup and restore job. Cheapest option.
Pilot Light: cross-Region DB replication costs (Aurora Global Database secondary cluster charges for compute + storage) + minimal infrastructure in standby. Data transfer between Regions ~$0.02/GB.
Warm Standby: scaled-down compute (smaller EC2 instances or lower Fargate task count) + full DB replication. Roughly 20–50% of production cost.
Active/Active: full production infrastructure in 2+ Regions. Roughly 2x production cost, minus efficiency from load sharing.
Elastic Disaster Recovery: per replicated server per hour (~$0.028/hour per server).

Always calculate: cost of DR infrastructure vs cost of downtime to the business.

Pros and Cons

Pros

AWS provides purpose-built DR services (Backup, DRS, Aurora Global, DynamoDB Global Tables) that simplify implementation.
Four-tier strategy model lets you match DR investment to business criticality.
Infrastructure as Code (CloudFormation, CDK) makes DR environments reproducible and testable.
Route 53 failover and health checks automate DNS switching.
S3 CRR and Aurora Global Database provide near-real-time cross-Region replication.

Cons

Multi-Region DR is expensive — Active/Active roughly doubles infrastructure cost.
Data consistency across Regions is hard — eventual consistency, conflict resolution, and split-brain scenarios require careful design.
DR plans that are not regularly tested often fail when needed.
Cross-Region data transfer costs add up for data-heavy workloads.
Operational complexity: runbooks, failover procedures, and failback procedures must be maintained.

Comparison with Alternatives

| | Backup & Restore | Pilot Light | Warm Standby | Active/Active | | --- | --- | --- | --- | --- | | RTO | Hours | Tens of minutes | Minutes | Near zero | | RPO | Hours | Seconds–minutes | Seconds | Near zero | | Cost | Lowest | Low | Moderate | Highest (~2x) | | Complexity | Low | Moderate | Moderate–High | High | | Infra in DR Region | None (rebuilt on demand) | DB replication only | Scaled-down full stack | Full production stack | | Best for | Non-critical workloads, dev/test | Important workloads with moderate RTO | Business-critical with low RTO | Mission-critical, global apps |

Most organizations use different strategies for different workloads — Active/Active for the revenue-generating platform, Backup & Restore for internal tools.

Exam Relevance

Cloud Practitioner (CLF-C02) — know the four DR strategies by name and their relative cost/RTO trade-offs. Understand RTO and RPO definitions.
Solutions Architect Associate (SAA-C03) — heavy coverage: match DR strategy to business requirements, design with Aurora Global Database, S3 CRR, Route 53 failover. Classic question: "The company needs RTO of 15 minutes and RPO of 1 minute" → Warm Standby or Active/Active.
Solutions Architect Professional (SAP-C02) — advanced multi-Region architectures, failover automation, data consistency trade-offs, cost optimization across DR tiers.

Exam trap: Pilot Light keeps the database replicated but compute is not running — do not confuse it with Warm Standby where compute is running at reduced scale.

Frequently Asked Questions

Q: What is the difference between RTO and RPO?

A: Recovery Time Objective (RTO) is the maximum acceptable time between a disaster and full service restoration. Recovery Point Objective (RPO) is the maximum acceptable amount of data loss measured in time — for example, an RPO of 1 hour means you can lose up to 1 hour of data. Together, RTO and RPO determine which DR strategy you need and how much it will cost.

Q: What is the difference between Pilot Light and Warm Standby?

A: In Pilot Light, only the data layer (database replication) is active in the DR Region — compute and application infrastructure are pre-configured but not running. In Warm Standby, a complete but scaled-down copy of the full production stack runs in the DR Region, including compute, application, and database layers. Warm Standby has a faster RTO because compute is already running and only needs to scale up.

Q: How do I test my DR plan on AWS?

A: Use AWS Fault Injection Service (FIS) to simulate AZ or Region failures. Run regular DR drills: trigger a failover to the DR Region, verify application functionality, then fail back. Elastic Disaster Recovery includes built-in drill functionality that launches test instances without affecting production. Document results, measure actual RTO/RPO against targets, and update runbooks. AWS recommends testing DR at least quarterly.

This article reflects AWS features and pricing as of 2026. AWS services evolve rapidly — always verify against the official AWS Disaster Recovery documentation before making production decisions.