DORA's four keys: a guardrail for the AI era, not a leaderboard

The four keys were never a benchmark to hit. They are a balanced diagnostic, and their entire value lives in the tension between two pairs: throughput — deployment frequency and change lead time — held against stability — change failure rate and failed-deployment recovery. Optimise either pair on its own and the number quietly lies to you. In 2026 that tension is the whole story: DORA's own research shows AI now lifts throughput while measurably degrading stability, so the real job of the four keys today is to act as a guardrail against shipping AI-accelerated change faster than you can keep it stable. And if you still believe there is an "elite" score to chase — DORA retired the elite, high, medium and low tiers in 2025.

Two pairs, and why any single key lies

Two of the metrics measure throughput: how quickly an idea reaches production. Two measure stability: what happens once it gets there. The finding that made DORA famous is that these do not trade off — the strongest teams are better at both at once.

Elite-performer benchmarks — the four keys

On-demand

Deployment frequency

deploy whenever you want

Under 1 day

Change lead time

commit → production

~5%

Change failure rate

deploys causing a problem

Under 1 hr

Failed-deploy recovery

time to restore service

Source: 2023 Accelerate State of DevOps. DORA retired these tiers in 2025 in favour of percentile distributions and seven team archetypes — read them as history, not a target.

The reason the four keys are reported together is that any one of them, alone, is trivial to game. Push deployment frequency in isolation and you ship more bad changes. Push change failure rate in isolation and you ship nothing, slowly, "to be safe." They are designed as a set precisely so they hold each other honest — speed and stability, or the number is not telling you the truth.

throughput

Throughput pair

Deployment frequency — a proxy for batch size, not raw speed
Change lead time — commit to running in production
Push these alone and you ship more, faster — bad changes included

stability

Stability pair

Change failure rate — share of deploys that cause a problem
Failed-deployment recovery — time to restore service
Push these alone and you ship nothing, slowly, 'to be safe'

Two pairs that must move together. Read one without the other and the dashboard lies.

How tightly they couple is itself a signal. The four keys normally move in tandem, but in the 2024 data they briefly stopped: the medium-performing cluster posted a lower change failure rate than the high cluster — the first time that had happened (getDX, RedMonk). That is not a paradox to resolve; it is a reminder that a single key, pulled out of context, tells you nothing.

What each key actually measures

The definitions matter, because most disagreements about DORA are really disagreements about measurement.

Deployment frequency is a band, not a counter. DORA reports it as on-demand, daily-to-weekly, weekly-to-monthly, or monthly-to-every-six-months — deliberately, so teams chase small, frequent, reversible changes rather than a vanity number. It is a proxy for batch size. Deploying forty times a day means nothing if half of them are hotfixes; that simply shows up as a worsening change failure rate and recovery time.

Change lead time uses Accelerate's precise definition — the time from code committed to code successfully running in production (Forsgren, Humble and Kim). It is measured that way on purpose: it is independent of task size, so a one-line fix and a large feature are scored on the same axis.

Change failure rate is the simplest of the four: failed deploys / total deploys over a window. Worked through: four deploys in a day, one of which causes a failure, is a 25% change failure rate. Failed-deployment recovery is time_resolved − time_created for the incident a change caused. DORA renamed it from MTTR in 2023 specifically to separate change-induced failures from external causes — a region outage is not your deploy pipeline's fault, and lumping them together poisons the metric.

Instrument all four from data you already have

The most useful and least appreciated fact about the four keys: you do not need a survey to measure them. Every input already exists in your tooling. Deployment events come from your CD tool or version control. Incidents come from your incident manager, or from issues tagged Incident. From those two streams, every key is a query.

Pipeline

Deploy events and incidents flow through an ETL step into a small warehouse; from those derived tables all four keys are simple queries — no survey involved.Source: After Google's open-source Four Keys reference architecture

In a warehouse, the change failure rate and recovery time fall out of a single query against those derived tables:

-- CFR and recovery over a 30-day window, from deploys + incidents
-- you already collect. No survey involved.
WITH win AS (
  SELECT TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) AS since
)
SELECT
  COUNT(DISTINCT d.deploy_id)                          AS deploys,
  COUNT(DISTINCT i.incident_id)                        AS change_incidents,
  SAFE_DIVIDE(COUNT(DISTINCT i.incident_id),
              COUNT(DISTINCT d.deploy_id))             AS change_failure_rate,
  APPROX_QUANTILES(
    TIMESTAMP_DIFF(i.time_resolved, i.time_created, MINUTE), 100
  )[OFFSET(75)]                                        AS recovery_p75_minutes
FROM four_keys.deployments d
LEFT JOIN four_keys.incidents i ON i.deploy_id = d.deploy_id
CROSS JOIN win
WHERE d.deployed_at >= win.since;

Google's open-source Four Keys project is the canonical do-it-yourself version of exactly this: an event-driven pipeline of webhook → Pub/Sub → Cloud Run ETL → BigQuery → dashboard, with derived tables four_keys.deployments, four_keys.changes and four_keys.incidents, and connectors for GitHub, GitLab and Cloud Build. One caveat that matters: the repository was archived on 23 January 2024. Use it as a reference design, not maintained software.

The thresholds the tool ships with are a useful sanity check, and they are blunter than the marketing benchmarks:

deployment frequency : multiple/day  > daily–weekly  > monthly    > monthly–6mo
change lead time     : under 1 day   > under 1 week  > 1 wk–1 mo  > 1–6 mo
time to restore      : under 1 hour  > under 1 day   > under 1 wk > over 1 wk
change failure rate  : green < 15%   | yellow 16–45% | red > 45%

If you would rather not run a pipeline, the metrics ship off the shelf. GitLab has built-in DORA metrics via its analytics API and CI/CD analytics; GitHub-centric stacks commonly use LinearB, Sleuth, Swarmia, DX or Apache DevLake. The definitions differ subtly between tools — especially how a "deployment" and an "incident" are detected. Pin the definition before you compare anything, or you are comparing two different measurements and calling it a trend.

The bar always moved — "elite" was a direction, never a target

Here is the fact that should kill the leaderboard mindset for good: the thresholds were never fixed. The change-failure-rate bar has been moved repeatedly, and in 2025 DORA threw out the tiers altogether.

How the four keys' definitions and benchmarks have shifted

01
2019 — four tiers
Elite/high/medium/low. Elite change failure rate banded at 0 to 15%.
02
2021 — reliability
A contextual fifth dimension joins the four keys.
03
2023 — names + bar
MTTR becomes failed-deployment recovery; elite CFR reported as a ~5% point estimate.
04
2024 — rework rate
A fifth key added; change failure rate treated as a proxy for rework.
05
2025 — tiers retired
Replaced by percentile distributions and seven team archetypes; top band near 0 to 2%.

Source: DORA — dora.dev/insights/dora-metrics-history; getDX; RDEL #115

Watch the change-failure-rate bar specifically: elite sat at 0 to 15% across the 2019, 2021 and 2022 reports; the 2023 report quoted a roughly 5% point estimate; and the 2025 percentile data puts the top band near 0 to 2% (getDX, scrums.com, RDEL #115). If "elite" meant 15% one year and 2% a few years later, it was never a finish line. It was a moving, context-dependent diagnostic the whole time.

The 2025 percentile snapshot is also a healthy dose of perspective on how rare the top of that old ladder really is. "On-demand" — the deploy cadence the elite tier was built around — turns out to be the exception, not the rule:

How teams actually deploy, 2025 — 'on-demand' is the exception, not the bar

On-demand (the old 'elite' cadence)16.2%

Daily to weekly21.9%

Less than monthly23.9%

Source: RDEL #115, citing 2025 DORA data — indicative; selected bands

Stability is rarer still: only about 8.5% of teams post a change failure rate in the 0-to-2% band (RDEL #115, citing 2025 DORA data — treat as indicative). DORA replaced the four tiers with seven archetypes spanning throughput, stability, product performance, burnout and friction — from Harmonious High-Achievers (around 20%) down through Pragmatic Performers, Constrained by Process, Stable & Methodical, Legacy Bottleneck, Foundational Challenges, to High Impact / Low Cadence (around 7%). The point of seven named clusters instead of a ladder is that there is no single line to stand above.

The goal isn't better DORA metrics. It's similar to driving a car with the goal of acceleration. Accelerating to where? Do we care if the engine explodes?

— Bryan Finster, Walmart DevOps Dojo

The 2026 job: the stability pair is your AI guardrail

This is what makes the four keys more useful in 2026, not less. AI changed the shape of the data, and it changed it in a direction the four keys are built to catch.

In the 2024 DORA report, 75% of respondents already relied on AI for at least one daily responsibility, and 39% reported little-to-no trust in AI-generated code. The modelled effect was telling: a 25% increase in AI adoption was associated with +7.5% documentation quality, +3.4% code quality and +3.1% review speed — but −1.5% delivery throughput and −7.2% delivery stability (Google Cloud, 2024 DORA report). The 2025 report — about 5,000 professionals plus more than 100 hours of qualitative work — found AI use near-universal at 90%, with median time spent at roughly two hours a day and 30% still reporting little or no trust. AI's relationship to throughput flipped positive versus 2024. Its negative relationship with delivery stability persisted (Google Cloud, 2025 DORA report).

The 2025 DORA signal

As AI adoption rises, throughput climbs while stability tends to fall — and a high-quality platform is what closes the gap.Source: Illustrative, after the 2025 DORA report

So the senior move is mechanical: as you scale AI, alert on the stability pair. Page when the rolling change failure rate or the failed-deployment recovery time trends up — even while deployment frequency and lead time are improving. That "even while" is the whole point. AI shipping more, faster, looks like a win on the throughput pair right up until the stability pair tells you what it cost.

# Page when the AI rollout is buying throughput at the cost of stability.
# Fires only when stability degrades WHILE throughput improves.
groups:
  - name: dora-stability-guardrail
    rules:
      - alert: ThroughputUpStabilityDown
        expr: |
          (rate(deployments_total[7d]) > rate(deployments_total[7d] offset 7d))
          and
          (dora_change_failure_rate_7d > dora_cfr_baseline * 1.25
           or dora_recovery_minutes_p75 > dora_recovery_p75_baseline * 1.25)
        for: 24h
        labels: { severity: page, team: platform }
        annotations:
          summary: "Throughput rising while change failure rate / recovery degrade"
          runbook: "Shrink batch size; tighten review on AI-authored changes."

The 2025 thesis is that AI is an amplifier, not a shortcut: with a strong internal platform it lifts a good delivery system, and without one it can net-harm performance. Platform quality is the difference-maker, and 90% of organisations now run at least one internal platform. DORA paired this with a seven-part AI capabilities model — a clear AI stance, healthy data ecosystems, AI-accessible internal data, strong version control, a user-centric focus, quality internal platforms, and working in small batches. Notice how much of that is the same boring delivery hygiene the four keys have always rewarded. The context underneath is sobering, too: across 2024 the high-performing cluster shrank from 31% to 22% of respondents while the low cluster grew from 17% to 25%. The field got shakier, not steadier — which is exactly when a stability guardrail earns its keep.

Seven ways teams ruin the four keys — and the fix

Most of the damage done with DORA is self-inflicted, and it follows a small number of well-worn patterns. The fix is almost always the same shape: move the goal onto outcomes, and demote the four keys back to leading indicators a team owns.

Anti-pattern	Why it bites	Do this instead
Four keys as per-team OKRs or quarterly targets	Goodhart's law — the moment a metric is the target it stops measuring reality	Set OKRs on user and business outcomes; treat the keys as leading indicators a team owns
Ranking teams on a leaderboard	Rewards gaming and ignores context; a very low recovery time plus a high deploy count often hides a hotfix process running outside normal flow	Trend each team against its own past baseline; investigate anomalies, do not score them
Optimising deployment frequency in isolation	It is a proxy for batch size, not speed; chase it alone and you ship more bad changes	Always read the throughput pair against the stability pair
Gaming under pressure — splitting PRs, not logging small incidents	The dashboard then measures fear, not delivery	Keep metrics at team level, never individual; invest in psychological safety so incidents get logged honestly
Treating "elite" as a fixed target to hit	DORA retired the tiers in 2025 and moved the bar for years (CFR 0–15% to 5% to near 0–2%)	Anchor on trend and on your binding constraint; re-baseline when the methodology changes
Shipping AI-accelerated code without watching stability	2025 DORA: AI lifts throughput but degrades delivery stability — failure rate and recovery inflate silently	Make change failure rate and recovery the explicit guardrail for the AI rollout; invest in platform quality and small batches
Treating DORA as a verdict on engineering health	It measures delivery only — silent on whether you build the right thing, on architecture, and on burnout	Pair the four keys with product / user-value and team-health signals

Use them as a compass: diagnose, fix the constraint, re-measure

The metrics are a diagnostic, not a scoreboard. The loop that actually works is unglamorous and repeatable: measure the set, find the one constraint that binds, fix that, and confirm the metric moved before you touch anything else.

Using the four keys as a compass, not a target

01
Measure
Instrument the four keys from existing pipeline and incident data — no survey needed.
02
Diagnose
Find the binding constraint: slow reviews, manual deploys, flaky tests, or slow recovery?
03
Improve
Fix that one constraint — automate the deploy, add a canary, speed up the pipeline.
04
Re-measure
Confirm the metric moved, then find the next constraint. Repeat.

What that loop produces, done honestly, is a clean before-and-after rather than a higher score. A cloud-modernisation programme reported by the consultancy Future Processing for a named client (ADIA) is a representative shape:

diagnosed

Before — the binding constraint

Change lead time around two months
Change failure rate over 30%
Manual, big-batch releases

re-measured

After — constraint removed

Change lead time around one day
Change failure rate under 10%
Around 50% lower client cloud cost

A modernisation programme measured with the four keys, before and after.Source: Future Processing engineering blog — named client, vendor-reported; treat as indicative.

The distance the loop can cover is the most citable evidence DORA produced. The 2019 report — more than 31,000 professionals, with about 20% classed elite — found elite teams deployed roughly 208 times more frequently than low performers. The figures often quoted alongside it (around 106 times faster lead time, 2,604 times faster recovery, 7 times lower change failure rate) are real but more loosely sourced; treat the 208x as the solid number and the rest as directional. Capital One is frequently cited as a roughly 20x release-frequency jump without added production incidents, but the primary source is hard to pin down — keep it as a recognisable name, not a number to bank on. The honest reading is the same either way: the gap between strong and weak delivery is categorical and achievable, and the loop above is how you close it.

Where the keys stop

DORA tells you about delivery performance. It is silent on whether you are building the right thing, whether the architecture is sound, or whether your people are burning out. That is not a flaw — it is scope. The four keys measure one important dimension well, and the failure mode is treating them as a verdict on engineering as a whole. The 2025 archetypes lean into exactly this nuance by adding burnout and friction as explicit dimensions, because throughput bought at the cost of a wrecked team is not a win you get to keep. Pair the four keys with product and user-value signals and with team-health measures, and they become what they were always meant to be: a strong, honest compass for one part of the journey — not the map.