Disaster recovery in the cloud: RPO, RTO and tested restores

The pattern is grimly consistent. An engineer pages the on-call at 2am, the primary region is down, and everyone turns to the backups — which exist, were taken automatically, and have never once been tested in anger. The restore procedure references a VPC that was deleted in a cost-optimisation sprint. An IAM role used by the recovery automation has an expired trust policy. The database snapshot was taken against a schema that is forty-two migrations behind the application code running in production today. The team discovers, during the incident, what "having disaster recovery" actually means.

This is not a tooling failure. AWS, GCP, and Azure all offer mature, well-documented backup and replication services. The failure is structural: teams treat a configured backup job as equivalent to a validated recovery capability. They are not the same thing. A backup is a precondition for DR; it is not DR. Real disaster recovery is a loop — define your targets, build an architecture that can meet them, and then rehearse the actual recovery on a schedule until doing it is boring. The loop only closes when you have actually executed a restore.

This article is the practical, hands-on version of that loop.

RPO and RTO: the two numbers that determine every decision

Two metrics define what a DR plan must deliver. Everything else follows from them.

Recovery Point Objective (RPO) is the maximum acceptable data loss, expressed as elapsed time. An RPO of one hour means you must be able to recover to a system state no older than one hour before the failure event. If your most recent backup is twenty-three hours old when the disaster strikes, you have failed your RPO by twenty-two hours before a single recovery step has been executed.

Recovery Time Objective (RTO) is the maximum acceptable elapsed time from the moment of failure to restored, traffic-serving operation. An RTO of four hours means your systems must be back online within four hours of the failure event, not four hours after the incident is declared or after the on-call team finishes debating the failure mode.

Both numbers are contractual commitments to the business. The cost of violating them is not abstract.

Downtime cost benchmarks across enterprise segments

>$300K/hr

Cost threshold exceeded by 90%+ of mid-to-large enterprises

ITIC 2024

$1M–$5M+/hr

Range reported by 41% of large enterprises

ITIC 2024

54%

Of significant outages cost more than $100K

Uptime Institute 2024

Source: ITIC Hourly Cost of Downtime Survey, 2024; Uptime Institute Annual Outage Analysis, 2024

The ITIC 2024 Hourly Cost of Downtime Survey found that over 90% of mid-size and large enterprises lose more than $300,000 per hour of downtime. For large enterprises specifically, 41% report hourly losses between $1M and $5M. The Uptime Institute's Annual Outage Analysis for 2024 found that 54% of operators said their most recent significant outage cost more than $100,000, with roughly 1 in 5 reporting that it exceeded $1 million. Gartner has long cited an approximate cross-industry baseline of $5,600 per minute — a figure widely reproduced in vendor research and useful for planning even though individual variance by industry and company size is enormous.

These are not tail risks. They are median experiences for organisations operating at scale. Setting RPO and RTO conservatively loose in exchange for a cheaper architecture is a legitimate business decision — but it must be made consciously, with the above numbers on the table, not by defaulting to whatever backup frequency the managed service offers out of the box.

The right method for setting these objectives is to work backwards from business impact, not forwards from infrastructure defaults:

What is the revenue or transaction throughput per hour for this system?
What regulatory or contractual SLA applies? Financial services regulated under DORA, healthcare systems under HIPAA, and payment processors under PCI-DSS all carry specific recovery requirements with enforcement teeth.
What is the realistic cost of customer churn if the system is visibly down for thirty minutes versus four hours versus a day?
What does tighter recovery actually cost — the infrastructure, the tooling, the operational headcount to maintain it?

The answers define a pair of numbers you can defend. Once you have them, every architecture decision and budget conversation has a clear frame.

Matching objectives to workload tiers

Applying your tightest targets uniformly across every system is expensive and usually unjustified. The more disciplined approach is to classify workloads by business criticality and assign tiered objectives.

| Tier | Example workloads | Target RPO | Target RTO | Typical strategy | |------|------------------|------------|------------|-----------------| | 0 — Critical | Payments, auth, core API gateway | Under 1 min | Under 15 min | Active-active multi-region | | 1 — Business-critical | Customer-facing application, primary database | 15 min | 1 hour | Warm standby | | 2 — Important | Internal tools, reporting, analytics pipeline | 4 hours | 8 hours | Pilot light | | 3 — Non-critical | Dev environments, batch jobs, cold archives | 24 hours | 24–48 hours | Backup and restore |

This exercise forces honest conversations. Most organisations discover their Tier 0 list is shorter than they assumed. A significant fraction of infrastructure — internal dashboards, staging environments, non-real-time reporting — comfortably sits at Tier 2 or 3, where backup and restore is entirely adequate and the cost difference versus warm standby is substantial.

A worked example to anchor the trade-off: a fintech processing $50M in daily transactions has a throughput of roughly $35,000 per minute. Adding the approximate $5,600/minute baseline for recovery operational costs (infrastructure spin-up, incident staff, SLA penalties), a 4-hour outage against the payment service carries an illustrative cost in the range of $9.7M. Against that exposure, a warm standby configuration costing $40,000 per month is not expensive — it is under five months of one incident. The same calculation applied to an internal billing dashboard yields a completely different answer, which is precisely the point of the exercise.

The four strategy tiers: what you actually buy

AWS formalised four DR strategies in their whitepaper "Disaster Recovery of Workloads on AWS: Recovery in the Cloud." GCP and Azure document essentially the same progression under different names. Understanding the real characteristics of each tier is prerequisite to choosing correctly.

Typical maximum RTO by DR strategy — illustrative upper bounds in minutes

Backup and restore>480 min

Pilot light60–240 min

Warm standby30–60 min

Active-active<10 min

Source: AWS Disaster Recovery of Workloads on AWS whitepaper

Backup and restore is the baseline. Regular snapshots — S3 object exports, RDS automated backups, EBS snapshots — are taken on a schedule, stored durably, and restored when needed. Cost is low because you pay only for snapshot storage. RTO is high: when disaster strikes, you are provisioning, configuring, and populating infrastructure from scratch, under incident conditions, with engineers who may not have done this before. Realistic RTO for any reasonably sized production workload is 4 to 24+ hours. This tier is appropriate only for Tier 2 and Tier 3 workloads.

Pilot light keeps a minimal version of core infrastructure running continuously in the recovery region. Your database is replicated via a read replica or AWS DMS continuous replication, DNS and compute are pre-configured but not serving traffic. On failover, you promote the replica, scale up compute, and update routing. The ongoing cost — a running replica and idle compute configuration — buys you an RTO of 1 to 4 hours instead of 4 to 24+.

Warm standby runs a scaled-down but fully functional copy of your production stack in the recovery region at all times. All services are running; traffic is simply not routed there. Failover is a load-balancer or DNS update plus an auto-scaling event — not an infrastructure provisioning exercise. RTO is typically under 1 hour, and under 15 minutes for well-instrumented stacks with pre-warmed compute. This is the practical default for most Tier 1 workloads.

Active-active runs full capacity in two or more regions simultaneously, with traffic distributed across them via weighted DNS or a global load balancer. There is no failover event: the affected region is simply removed from the traffic pool. RTO is effectively near-zero. RPO is also near-zero with synchronous or near-synchronous replication. Active-active is the most expensive tier to build and by far the most complex to operate correctly. Consistency semantics across regions, conflict resolution in distributed writes, and the operational burden of managing dual-region deployments are all real costs. It is justified for Tier 0 workloads where the cost of any downtime exceeds the cost of running parallel infrastructure.

Tier 2–3

Backup and restore

RTO 4–24+ hours; RPO typically 24 hours
Cheapest: pay only for snapshot storage
Full infrastructure re-provision during an active incident
No replication running; last backup defines your recovery point

Tier 1

Warm standby

RTO under 1 hour; RPO minutes with active replication
Moderate cost: a scaled-down stack running continuously
Failover is DNS update and auto-scale — not re-provision
Replication lag is your real-time, measurable RPO

The two most commonly applicable tiers — and why the gap is larger than it appearsSource: AWS Disaster Recovery of Workloads on AWS whitepaper

Automating the backup and replication layer

Manual backup discipline does not survive organisational change. Engineers rotate. Priorities shift. A quarterly backup verification task on a Jira board will eventually be missed, and it will be missed silently. Every backup must be automated, the schedule must be policy-enforced, and the storage must be immutable.

On AWS, the foundation is AWS Backup with cross-region copy rules combined with S3 Object Lock for the underlying storage.

resource "aws_backup_plan" "production" {
  name = "production-dr-plan"
 
  rule {
    rule_name         = "daily-snapshot"
    target_vault_name = aws_backup_vault.primary.name
    schedule          = "cron(0 2 * * ? *)"  # 02:00 UTC daily
 
    lifecycle {
      delete_after = 35  # 35-day retention in primary region
    }
 
    copy_action {
      destination_vault_arn = aws_backup_vault.dr_region.arn
 
      lifecycle {
        delete_after = 14  # 14-day retention in DR region
      }
    }
  }
}

S3 Object Lock in Compliance mode prevents any principal — including the root account — from deleting or overwriting objects within the retention window. This is the primary defence against ransomware that has compromised your AWS credentials and is targeting your backups before encrypting production.

# Object Lock must be enabled at bucket creation — it cannot be added post-hoc
aws s3api create-bucket \
  --bucket prod-backups-dr-eu \
  --region eu-west-1 \
  --create-bucket-configuration LocationConstraint=eu-west-1 \
  --object-lock-enabled-for-bucket
 
# Apply a default Compliance retention of 30 days
aws s3api put-object-lock-configuration \
  --bucket prod-backups-dr-eu \
  --object-lock-configuration '{
    "ObjectLockEnabled": "Enabled",
    "Rule": {
      "DefaultRetention": {
        "Mode": "COMPLIANCE",
        "Days": 30
      }
    }
  }'

For database replication in warm standby or active-active configurations, RDS cross-region read replicas are the standard mechanism. The replication lag on the replica is your real-world RPO — and it must be monitored continuously, not assumed.

# Create a cross-region read replica from your primary RDS instance
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-postgres-dr \
  --source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:prod-postgres \
  --db-instance-class db.t3.medium \
  --availability-zone eu-west-1a \
  --region eu-west-1 \
  --no-auto-minor-version-upgrade \
  --no-publicly-accessible

Alert when ReplicaLag (CloudWatch metric, available under the RDS namespace) exceeds 80% of your RPO threshold. A replica lagging by 45 minutes when your target RPO is 15 minutes means you are already in breach of your recovery objective before any disaster has occurred.

For Kubernetes workloads, Velero with cross-region S3 storage handles cluster state and persistent volume snapshots on a schedule:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-hourly
  namespace: velero
spec:
  schedule: "0 * * * *"
  template:
    ttl: 168h0m0s        # 7-day retention
    storageLocation: aws-dr-region
    includedNamespaces:
      - production
      - platform
    snapshotVolumes: true

One additional automation layer worth implementing: periodic backup validation jobs. A nightly Lambda or Kubernetes CronJob that restores a recent snapshot into an isolated test environment, runs a schema validity check and a row-count comparison, and emits a success or failure metric costs almost nothing and catches credential drift, snapshot corruption, and schema incompatibility before they surface during an incident.

Running the restore drill: a structured approach

This is the section most teams skip, which is why most DR plans fail when they are needed. A restore drill is not a theoretical walkthrough or a tabletop exercise — it is a scheduled, documented execution of your actual recovery procedure in a production-parity environment, clocked against your RTO and RPO targets.

Structured DR restore drill

01
Define the drill scope
Specify which workload, which failure scenario (region failure, data corruption, ransomware), and the clean isolated environment you will restore into. Document your target RTO and RPO before the drill begins — these are the numbers you are testing against, not adjusting after the fact.
02
Verify the backup artifact
Confirm the backup you intend to restore from exists, is accessible from the recovery account and region, and has not been corrupted. Record the backup timestamp — this defines your theoretical RPO for this drill. A backup that exists but is inaccessible is not a backup.
03
Execute the restore from the runbook only
Follow the documented runbook with no improvisation. Record every manual step not in the runbook, every blocked action, and every unexpected dependency you encounter. These are bugs in the runbook, not one-off anomalies. If something is not in the runbook, it will not be in the runbook at 2am either.
04
Validate correctness, not just availability
Run smoke tests, schema integrity checks, and representative user journeys. Compare row counts, checksums, or event-log totals against the known source state. A restore that completes and returns a running application but serves corrupted data is more dangerous than one that fails cleanly — it may go undetected.
05
Record actual RTO and RPO
Clock the wall time from the start of the drill to the moment the application passes its health checks and serves valid traffic. Compare against your targets. If you exceeded either target, you have a gap to close before the next drill — not a variance to document and revisit.
06
Remediate and rotate
Fix every undocumented dependency, every expired credential, and every gap found during the drill before closing the ticket. Rotate any credentials used during the drill. Update the runbook. The drill's output is a corrected runbook, not a report.

Source: ClimsTech Engineering

An untested DR plan is not a safety net. It is a document describing how you intended to recover.

— Production DR engineering principle

Run the drill on a quarterly cadence at minimum. For Tier 0 workloads, monthly is warranted. Automate what you can: a restore pipeline that can be triggered by a CI job and emits a pass/fail with actual RTO and RPO measurements converts the drill from a manual exercise into a continuously running regression test. When the drill is automated and unremarkable, you are in the right place.

Pitfalls that sink DR plans in production

Knowing what typically goes wrong is half the defence. These are the failure modes we see most often.

Credential expiry. IAM roles, cross-account trust policies, service account keys, and API tokens all have lifetimes — some explicit, some implicit through policy changes. The backup job that ran correctly last quarter may fail silently this quarter because a trust policy was tightened, a key was rotated, or an assumed role's maximum session duration was reduced. Audit every credential in your DR execution path — backup agents, replication jobs, recovery scripts — and monitor expiry actively. Silent credential failures are the single most common cause of undetected backup job outages.

Schema drift. You take daily database snapshots. Over the following six months, forty-two schema migrations run against production. Your application now expects a table structure, index set, and foreign key schema that do not exist in the snapshot. A restore of that snapshot cannot be used directly by the current application. The fix is not complicated but must be enforced: version your migration scripts alongside your application code, test forward-migration from each backup generation in your automated drill, and fail the drill if the migration path is broken.

Network configuration omission. A restored database in eu-west-1 is functionally useless if your application in us-east-1 cannot reach it because the security groups, VPC peering, Route 53 private hosted zone entries, and NAT gateway routes in the recovery environment were not included in the DR configuration. Every network dependency must be codified in infrastructure-as-code and deployed as part of the DR stack. Manual network configuration under incident pressure is a reliable source of hours-long delays.

Underestimating restore time at scale. Restoring a 50 GB PostgreSQL database from an RDS snapshot takes roughly 10 to 20 minutes. Restoring 5 TB takes proportionally longer. Restoring cold data from S3 Glacier incurs retrieval latency that can itself consume most of an RTO window. Teams with fast-growing data estates often find their actual restore time has quietly exceeded their RTO target because data volume grew by 10x but the recovery architecture was designed for the original scale. Measure actual restore times quarterly, not at deployment time and never again.

Dependency hell in application startup. Modern cloud applications depend on dozens to hundreds of external resources at startup: Secrets Manager entries, Parameter Store paths, ElastiCache clusters, Kafka topics, third-party API endpoints, feature flag services. A runbook that says "restore the database and start the application" will fail silently on the first missing dependency if those dependencies do not exist in the recovery environment. Build your runbook by doing a complete restore into a genuinely clean account with no pre-existing configuration, and document every failure as a required step.

Treating DR as a compliance artefact. The most dangerous DR anti-pattern is building a plan for an audit, documenting its existence in a wiki, and running it exactly once at launch. The plan degrades from the moment it is written: systems change, dependencies shift, credentials expire, data volumes grow, team members who wrote the runbook leave. A DR plan that has not been executed end-to-end in the last ninety days is, with high probability, not fully executable today.

Making DR observable: continuous readiness over point-in-time state

DR readiness should be a continuous, observable property of your system — not a binary "we have DR" statement from last year's audit. You need monitoring that tells you right now whether your recovery path is intact.

Four things to instrument and keep on your operational dashboard:

Backup job success rate. Every backup job should emit a metric. Alert immediately and urgently on any failure. Do not accept daily summary emails or weekly reports — by the time you read a digest, you may already be twenty-three hours behind your RPO with no awareness of it.

Replication lag as real-time RPO. For warm standby and active-active architectures, your RDS ReplicaLag metric, Kafka consumer lag, or DMS replication latency is not an infrastructure curiosity — it is your actual RPO as of this moment. Set an alarm at 80% of your RPO threshold. If your target is 15 minutes and your replica lag is 12 minutes, you need to know before it reaches 15.

Recovery artifact accessibility from the DR account. An S3 Object Lock configuration does not guarantee the objects are in the correct region, are encrypted under an accessible KMS key version, or are reachable from your recovery account's IAM boundary. Run a nightly synthetic restore-probe job that attempts the full access path — not just a HeadObject — from inside the recovery account, in the recovery region.

Drill recency. Track the date of the last successful end-to-end restore drill as a metric and alert when it exceeds your cadence threshold. "Days since last successful DR test" is the single most informative number on a reliability dashboard that includes DR, and the most commonly absent one.

A lightweight CloudWatch, Grafana, or Datadog dashboard combining these four signals gives on-call engineers immediate situational awareness about DR readiness. The goal is for DR posture to be as visible as your error rate or p99 latency — not something looked up in a Confluence document after an incident has started.

What to remember

Set RPO and RTO from business impact first — work backwards from revenue-per-minute and regulatory requirements, not forwards from backup tool defaults. Justify every number with a cost model.
Tier your workloads honestly: most systems do not warrant active-active. Overspending on Tier 0 infrastructure for Tier 2 workloads is as operationally harmful as underspending.
Automate every backup with cross-region copies and immutable object storage (S3 Object Lock in Compliance mode). Manual backup discipline does not survive a team rotation or a busy quarter.
Run a structured restore drill quarterly at minimum. Clock the actual RTO and RPO. A drill that finds nothing is not a success — it means you are not testing under realistic conditions.
Audit all credentials in your DR execution path for expiry on a continuous basis. Silent credential failure is the most common class of undetected backup breakage.
Monitor replication lag as a real-time RPO indicator and alert at 80% of your target threshold. If your replica is already lagging behind your RPO, you are in breach before any incident is declared.
Schema drift will break restores. Validate that every backup generation can be migrated forward to current application schema. Include this check in your automated drill as a hard pass/fail gate.
Treat DR readiness as a continuously observable property: backup success rate, replication lag, and days since last successful drill belong on your operational dashboard, not in an annual review document.