AWS Disaster Recovery: Strategies, RTO/RPO, and Key Services

Definition

Disaster Recovery (DR) on AWS is the process of preparing for and recovering from events that make your primary workload unavailable — whether an AZ outage, Region-level failure, data corruption, or human error. AWS defines four DR strategies of increasing cost and speed: Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active.

Two metrics govern every DR plan: Recovery Time Objective (RTO) — how quickly you must be back online — and Recovery Point Objective (RPO) — how much data loss is acceptable, measured in time since the last good backup or replication point.

How It Works

The 4 DR Strategies

1. Backup & Restore (lowest cost, highest RTO/RPO)

  • Back up data to S3, use AWS Backup for automated snapshots of EBS, RDS, DynamoDB, and EFS.
  • In a disaster, restore from backups and rebuild infrastructure (via CloudFormation/CDK).
  • RTO: hours. RPO: hours (depends on backup frequency).

2. Pilot Light (low cost, moderate RTO)

  • Keep a minimal version of the environment running in the DR Region — core database replication only (RDS cross-Region Read Replica, Aurora Global Database).
  • Compute and application layers are pre-configured but not running.
  • In a disaster, scale up compute, promote the DB replica, and switch DNS.
  • RTO: tens of minutes. RPO: seconds to minutes (continuous replication).

3. Warm Standby (moderate cost, lower RTO)

  • A scaled-down but fully functional copy of the production environment runs in the DR Region.
  • All layers (compute, DB, networking) are active but at reduced capacity.
  • In a disaster, scale up to full production capacity and switch DNS.
  • RTO: minutes. RPO: seconds (near-real-time replication).

4. Multi-Site Active/Active (highest cost, near-zero RTO/RPO)

  • Full production capacity runs in 2+ Regions simultaneously, serving traffic from both.
  • Route 53 distributes traffic via latency-based or weighted routing.
  • Aurora Global Database or DynamoDB Global Tables handle data replication.
  • In a disaster, Route 53 health checks automatically route all traffic to the healthy Region.
  • RTO: near zero (automatic). RPO: near zero (synchronous or near-synchronous replication).

Key Features and Limits

  • AWS Backup: centralized, policy-driven backup service supporting 15+ AWS services; cross-Region and cross-account backup vaults; retention policies and compliance controls (Vault Lock).
  • AWS Elastic Disaster Recovery (DRS): continuous block-level replication of on-premises or cloud servers to AWS; automated failover and failback; replaces the former CloudEndure Disaster Recovery.
  • Route 53 failover routing: health-check-based DNS failover between primary and DR endpoints.
  • Aurora Global Database: sub-second replication across Regions; promotes a secondary Region to read-write in under a minute.
  • DynamoDB Global Tables: multi-Region, multi-active replication with single-digit millisecond latency.
  • S3 Cross-Region Replication (CRR): automatic asynchronous replication of objects to a bucket in another Region.
  • CloudFormation StackSets: deploy identical infrastructure across Regions and accounts for DR consistency.
  • Limits: Multi-Region DR adds cost (compute, storage, data transfer), complexity (data consistency, conflict resolution), and operational burden (testing, runbooks). Active/Active requires careful handling of write conflicts.

Common Use Cases

  1. Compliance-mandated DR — financial services and healthcare regulations often require documented DR plans with tested RTO/RPO.
  2. Business continuity for SaaS platforms — customers expect uptime SLAs that only Warm Standby or Active/Active can deliver.
  3. Protection against data corruption — Backup & Restore with point-in-time recovery protects against accidental deletes and ransomware.
  4. On-premises to AWS DR — use Elastic Disaster Recovery to replicate on-prem servers to AWS as a DR site.
  5. Multi-Region global applications — Active/Active serves users from the nearest Region and provides DR as a side effect of the architecture.

Pricing Model

DR costs scale with the strategy:

  • Backup & Restore: S3 storage (~$0.023/GB/month for Standard), EBS snapshot storage (~$0.05/GB/month), AWS Backup service charges per backup and restore job. Cheapest option.
  • Pilot Light: cross-Region DB replication costs (Aurora Global Database secondary cluster charges for compute + storage) + minimal infrastructure in standby. Data transfer between Regions ~$0.02/GB.
  • Warm Standby: scaled-down compute (smaller EC2 instances or lower Fargate task count) + full DB replication. Roughly 20–50% of production cost.
  • Active/Active: full production infrastructure in 2+ Regions. Roughly 2x production cost, minus efficiency from load sharing.
  • Elastic Disaster Recovery: per replicated server per hour (~$0.028/hour per server).

Always calculate: cost of DR infrastructure vs cost of downtime to the business.

Pros and Cons

Pros

  • AWS provides purpose-built DR services (Backup, DRS, Aurora Global, DynamoDB Global Tables) that simplify implementation.
  • Four-tier strategy model lets you match DR investment to business criticality.
  • Infrastructure as Code (CloudFormation, CDK) makes DR environments reproducible and testable.
  • Route 53 failover and health checks automate DNS switching.
  • S3 CRR and Aurora Global Database provide near-real-time cross-Region replication.

Cons

  • Multi-Region DR is expensive — Active/Active roughly doubles infrastructure cost.
  • Data consistency across Regions is hard — eventual consistency, conflict resolution, and split-brain scenarios require careful design.
  • DR plans that are not regularly tested often fail when needed.
  • Cross-Region data transfer costs add up for data-heavy workloads.
  • Operational complexity: runbooks, failover procedures, and failback procedures must be maintained.

Comparison with Alternatives

| | Backup & Restore | Pilot Light | Warm Standby | Active/Active | | --- | --- | --- | --- | --- | | RTO | Hours | Tens of minutes | Minutes | Near zero | | RPO | Hours | Seconds–minutes | Seconds | Near zero | | Cost | Lowest | Low | Moderate | Highest (~2x) | | Complexity | Low | Moderate | Moderate–High | High | | Infra in DR Region | None (rebuilt on demand) | DB replication only | Scaled-down full stack | Full production stack | | Best for | Non-critical workloads, dev/test | Important workloads with moderate RTO | Business-critical with low RTO | Mission-critical, global apps |

Most organizations use different strategies for different workloads — Active/Active for the revenue-generating platform, Backup & Restore for internal tools.

Exam Relevance

  • Cloud Practitioner (CLF-C02) — know the four DR strategies by name and their relative cost/RTO trade-offs. Understand RTO and RPO definitions.
  • Solutions Architect Associate (SAA-C03) — heavy coverage: match DR strategy to business requirements, design with Aurora Global Database, S3 CRR, Route 53 failover. Classic question: "The company needs RTO of 15 minutes and RPO of 1 minute" → Warm Standby or Active/Active.
  • Solutions Architect Professional (SAP-C02) — advanced multi-Region architectures, failover automation, data consistency trade-offs, cost optimization across DR tiers.

Exam trap: Pilot Light keeps the database replicated but compute is not running — do not confuse it with Warm Standby where compute is running at reduced scale.

Frequently Asked Questions

Q: What is the difference between RTO and RPO?

A: Recovery Time Objective (RTO) is the maximum acceptable time between a disaster and full service restoration. Recovery Point Objective (RPO) is the maximum acceptable amount of data loss measured in time — for example, an RPO of 1 hour means you can lose up to 1 hour of data. Together, RTO and RPO determine which DR strategy you need and how much it will cost.

Q: What is the difference between Pilot Light and Warm Standby?

A: In Pilot Light, only the data layer (database replication) is active in the DR Region — compute and application infrastructure are pre-configured but not running. In Warm Standby, a complete but scaled-down copy of the full production stack runs in the DR Region, including compute, application, and database layers. Warm Standby has a faster RTO because compute is already running and only needs to scale up.

Q: How do I test my DR plan on AWS?

A: Use AWS Fault Injection Service (FIS) to simulate AZ or Region failures. Run regular DR drills: trigger a failover to the DR Region, verify application functionality, then fail back. Elastic Disaster Recovery includes built-in drill functionality that launches test instances without affecting production. Document results, measure actual RTO/RPO against targets, and update runbooks. AWS recommends testing DR at least quarterly.


This article reflects AWS features and pricing as of 2026. AWS services evolve rapidly — always verify against the official AWS Disaster Recovery documentation before making production decisions.

Published: 4/17/2026 / Updated: 4/26/2026

This article is for informational purposes only. AWS services, pricing, and features change frequently — always verify details against the official AWS documentation before making production decisions.

More in Concepts