The four keys were never a benchmark to hit. They are a balanced diagnostic, and their entire value lives in the tension between two pairs: throughput — deployment frequency and change lead time — held against stability — change failure rate and failed-deployment recovery. Optimise either pair on its own and the number quietly lies to you. In 2026 that tension is the whole story: DORA's own research shows AI now lifts throughput while measurably degrading stability, so the real job of the four keys today is to act as a guardrail against shipping AI-accelerated change faster than you can keep it stable. And if you still believe there is an "elite" score to chase — DORA retired the elite, high, medium and low tiers in 2025.
Two pairs, and why any single key lies
Two of the metrics measure throughput: how quickly an idea reaches production. Two measure stability: what happens once it gets there. The finding that made DORA famous is that these do not trade off — the strongest teams are better at both at once.
On-demand
Deployment frequency
deploy whenever you want
Under 1 day
Change lead time
commit → production
~5%
Change failure rate
deploys causing a problem
Under 1 hr
Failed-deploy recovery
time to restore service
Source: 2023 Accelerate State of DevOps. DORA retired these tiers in 2025 in favour of percentile distributions and seven team archetypes — read them as history, not a target.
The reason the four keys are reported together is that any one of them, alone, is trivial to game. Push deployment frequency in isolation and you ship more bad changes. Push change failure rate in isolation and you ship nothing, slowly, "to be safe." They are designed as a set precisely so they hold each other honest — speed and stability, or the number is not telling you the truth.
Throughput pair
- Deployment frequency — a proxy for batch size, not raw speed
- Change lead time — commit to running in production
- Push these alone and you ship more, faster — bad changes included
Stability pair
- Change failure rate — share of deploys that cause a problem
- Failed-deployment recovery — time to restore service
- Push these alone and you ship nothing, slowly, 'to be safe'
How tightly they couple is itself a signal. The four keys normally move in tandem, but in the 2024 data they briefly stopped: the medium-performing cluster posted a lower change failure rate than the high cluster — the first time that had happened (getDX, RedMonk). That is not a paradox to resolve; it is a reminder that a single key, pulled out of context, tells you nothing.
What each key actually measures
The definitions matter, because most disagreements about DORA are really disagreements about measurement.
Deployment frequency is a band, not a counter. DORA reports it as on-demand, daily-to-weekly, weekly-to-monthly, or monthly-to-every-six-months — deliberately, so teams chase small, frequent, reversible changes rather than a vanity number. It is a proxy for batch size. Deploying forty times a day means nothing if half of them are hotfixes; that simply shows up as a worsening change failure rate and recovery time.
Change lead time uses Accelerate's precise definition — the time from code committed to code successfully running in production (Forsgren, Humble and Kim). It is measured that way on purpose: it is independent of task size, so a one-line fix and a large feature are scored on the same axis.
Change failure rate is the simplest of the four: failed deploys / total deploys over a window. Worked through: four deploys in a day, one of which causes a failure, is a 25% change failure rate. Failed-deployment recovery is time_resolved − time_created for the incident a change caused. DORA renamed it from MTTR in 2023 specifically to separate change-induced failures from external causes — a region outage is not your deploy pipeline's fault, and lumping them together poisons the metric.
Instrument all four from data you already have
The most useful and least appreciated fact about the four keys: you do not need a survey to measure them. Every input already exists in your tooling. Deployment events come from your CD tool or version control. Incidents come from your incident manager, or from issues tagged Incident. From those two streams, every key is a query.
In a warehouse, the change failure rate and recovery time fall out of a single query against those derived tables:
-- CFR and recovery over a 30-day window, from deploys + incidents
-- you already collect. No survey involved.
WITH win AS (
SELECT TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) AS since
)
SELECT
COUNT(DISTINCT d.deploy_id) AS deploys,
COUNT(DISTINCT i.incident_id) AS change_incidents,
SAFE_DIVIDE(COUNT(DISTINCT i.incident_id),
COUNT(DISTINCT d.deploy_id)) AS change_failure_rate,
APPROX_QUANTILES(
TIMESTAMP_DIFF(i.time_resolved, i.time_created, MINUTE), 100
)[OFFSET(75)] AS recovery_p75_minutes
FROM four_keys.deployments d
LEFT JOIN four_keys.incidents i ON i.deploy_id = d.deploy_id
CROSS JOIN win
WHERE d.deployed_at >= win.since;Google's open-source Four Keys project is the canonical do-it-yourself version of exactly this: an event-driven pipeline of webhook → Pub/Sub → Cloud Run ETL → BigQuery → dashboard, with derived tables four_keys.deployments, four_keys.changes and four_keys.incidents, and connectors for GitHub, GitLab and Cloud Build. One caveat that matters: the repository was archived on 23 January 2024. Use it as a reference design, not maintained software.
The thresholds the tool ships with are a useful sanity check, and they are blunter than the marketing benchmarks:
deployment frequency : multiple/day > daily–weekly > monthly > monthly–6mo
change lead time : under 1 day > under 1 week > 1 wk–1 mo > 1–6 mo
time to restore : under 1 hour > under 1 day > under 1 wk > over 1 wk
change failure rate : green < 15% | yellow 16–45% | red > 45%If you would rather not run a pipeline, the metrics ship off the shelf. GitLab has built-in DORA metrics via its analytics API and CI/CD analytics; GitHub-centric stacks commonly use LinearB, Sleuth, Swarmia, DX or Apache DevLake. The definitions differ subtly between tools — especially how a "deployment" and an "incident" are detected. Pin the definition before you compare anything, or you are comparing two different measurements and calling it a trend.
The bar always moved — "elite" was a direction, never a target
Here is the fact that should kill the leaderboard mindset for good: the thresholds were never fixed. The change-failure-rate bar has been moved repeatedly, and in 2025 DORA threw out the tiers altogether.
- 01
2019 — four tiers
Elite/high/medium/low. Elite change failure rate banded at 0 to 15%.
- 02
2021 — reliability
A contextual fifth dimension joins the four keys.
- 03
2023 — names + bar
MTTR becomes failed-deployment recovery; elite CFR reported as a ~5% point estimate.
- 04
2024 — rework rate
A fifth key added; change failure rate treated as a proxy for rework.
- 05
2025 — tiers retired
Replaced by percentile distributions and seven team archetypes; top band near 0 to 2%.
Source: DORA — dora.dev/insights/dora-metrics-history; getDX; RDEL #115
Watch the change-failure-rate bar specifically: elite sat at 0 to 15% across the 2019, 2021 and 2022 reports; the 2023 report quoted a roughly 5% point estimate; and the 2025 percentile data puts the top band near 0 to 2% (getDX, scrums.com, RDEL #115). If "elite" meant 15% one year and 2% a few years later, it was never a finish line. It was a moving, context-dependent diagnostic the whole time.
The 2025 percentile snapshot is also a healthy dose of perspective on how rare the top of that old ladder really is. "On-demand" — the deploy cadence the elite tier was built around — turns out to be the exception, not the rule:
Stability is rarer still: only about 8.5% of teams post a change failure rate in the 0-to-2% band (RDEL #115, citing 2025 DORA data — treat as indicative). DORA replaced the four tiers with seven archetypes spanning throughput, stability, product performance, burnout and friction — from Harmonious High-Achievers (around 20%) down through Pragmatic Performers, Constrained by Process, Stable & Methodical, Legacy Bottleneck, Foundational Challenges, to High Impact / Low Cadence (around 7%). The point of seven named clusters instead of a ladder is that there is no single line to stand above.
The goal isn't better DORA metrics. It's similar to driving a car with the goal of acceleration. Accelerating to where? Do we care if the engine explodes?
The 2026 job: the stability pair is your AI guardrail
This is what makes the four keys more useful in 2026, not less. AI changed the shape of the data, and it changed it in a direction the four keys are built to catch.
In the 2024 DORA report, 75% of respondents already relied on AI for at least one daily responsibility, and 39% reported little-to-no trust in AI-generated code. The modelled effect was telling: a 25% increase in AI adoption was associated with +7.5% documentation quality, +3.4% code quality and +3.1% review speed — but −1.5% delivery throughput and −7.2% delivery stability (Google Cloud, 2024 DORA report). The 2025 report — about 5,000 professionals plus more than 100 hours of qualitative work — found AI use near-universal at 90%, with median time spent at roughly two hours a day and 30% still reporting little or no trust. AI's relationship to throughput flipped positive versus 2024. Its negative relationship with delivery stability persisted (Google Cloud, 2025 DORA report).
So the senior move is mechanical: as you scale AI, alert on the stability pair. Page when the rolling change failure rate or the failed-deployment recovery time trends up — even while deployment frequency and lead time are improving. That "even while" is the whole point. AI shipping more, faster, looks like a win on the throughput pair right up until the stability pair tells you what it cost.
# Page when the AI rollout is buying throughput at the cost of stability.
# Fires only when stability degrades WHILE throughput improves.
groups:
- name: dora-stability-guardrail
rules:
- alert: ThroughputUpStabilityDown
expr: |
(rate(deployments_total[7d]) > rate(deployments_total[7d] offset 7d))
and
(dora_change_failure_rate_7d > dora_cfr_baseline * 1.25
or dora_recovery_minutes_p75 > dora_recovery_p75_baseline * 1.25)
for: 24h
labels: { severity: page, team: platform }
annotations:
summary: "Throughput rising while change failure rate / recovery degrade"
runbook: "Shrink batch size; tighten review on AI-authored changes."The 2025 thesis is that AI is an amplifier, not a shortcut: with a strong internal platform it lifts a good delivery system, and without one it can net-harm performance. Platform quality is the difference-maker, and 90% of organisations now run at least one internal platform. DORA paired this with a seven-part AI capabilities model — a clear AI stance, healthy data ecosystems, AI-accessible internal data, strong version control, a user-centric focus, quality internal platforms, and working in small batches. Notice how much of that is the same boring delivery hygiene the four keys have always rewarded. The context underneath is sobering, too: across 2024 the high-performing cluster shrank from 31% to 22% of respondents while the low cluster grew from 17% to 25%. The field got shakier, not steadier — which is exactly when a stability guardrail earns its keep.
Seven ways teams ruin the four keys — and the fix
Most of the damage done with DORA is self-inflicted, and it follows a small number of well-worn patterns. The fix is almost always the same shape: move the goal onto outcomes, and demote the four keys back to leading indicators a team owns.
| Anti-pattern | Why it bites | Do this instead |
|---|---|---|
| Four keys as per-team OKRs or quarterly targets | Goodhart's law — the moment a metric is the target it stops measuring reality | Set OKRs on user and business outcomes; treat the keys as leading indicators a team owns |
| Ranking teams on a leaderboard | Rewards gaming and ignores context; a very low recovery time plus a high deploy count often hides a hotfix process running outside normal flow | Trend each team against its own past baseline; investigate anomalies, do not score them |
| Optimising deployment frequency in isolation | It is a proxy for batch size, not speed; chase it alone and you ship more bad changes | Always read the throughput pair against the stability pair |
| Gaming under pressure — splitting PRs, not logging small incidents | The dashboard then measures fear, not delivery | Keep metrics at team level, never individual; invest in psychological safety so incidents get logged honestly |
| Treating "elite" as a fixed target to hit | DORA retired the tiers in 2025 and moved the bar for years (CFR 0–15% to 5% to near 0–2%) | Anchor on trend and on your binding constraint; re-baseline when the methodology changes |
| Shipping AI-accelerated code without watching stability | 2025 DORA: AI lifts throughput but degrades delivery stability — failure rate and recovery inflate silently | Make change failure rate and recovery the explicit guardrail for the AI rollout; invest in platform quality and small batches |
| Treating DORA as a verdict on engineering health | It measures delivery only — silent on whether you build the right thing, on architecture, and on burnout | Pair the four keys with product / user-value and team-health signals |
Use them as a compass: diagnose, fix the constraint, re-measure
The metrics are a diagnostic, not a scoreboard. The loop that actually works is unglamorous and repeatable: measure the set, find the one constraint that binds, fix that, and confirm the metric moved before you touch anything else.
- 01
Measure
Instrument the four keys from existing pipeline and incident data — no survey needed.
- 02
Diagnose
Find the binding constraint: slow reviews, manual deploys, flaky tests, or slow recovery?
- 03
Improve
Fix that one constraint — automate the deploy, add a canary, speed up the pipeline.
- 04
Re-measure
Confirm the metric moved, then find the next constraint. Repeat.
What that loop produces, done honestly, is a clean before-and-after rather than a higher score. A cloud-modernisation programme reported by the consultancy Future Processing for a named client (ADIA) is a representative shape:
Before — the binding constraint
- Change lead time around two months
- Change failure rate over 30%
- Manual, big-batch releases
After — constraint removed
- Change lead time around one day
- Change failure rate under 10%
- Around 50% lower client cloud cost
The distance the loop can cover is the most citable evidence DORA produced. The 2019 report — more than 31,000 professionals, with about 20% classed elite — found elite teams deployed roughly 208 times more frequently than low performers. The figures often quoted alongside it (around 106 times faster lead time, 2,604 times faster recovery, 7 times lower change failure rate) are real but more loosely sourced; treat the 208x as the solid number and the rest as directional. Capital One is frequently cited as a roughly 20x release-frequency jump without added production incidents, but the primary source is hard to pin down — keep it as a recognisable name, not a number to bank on. The honest reading is the same either way: the gap between strong and weak delivery is categorical and achievable, and the loop above is how you close it.
Where the keys stop
DORA tells you about delivery performance. It is silent on whether you are building the right thing, whether the architecture is sound, or whether your people are burning out. That is not a flaw — it is scope. The four keys measure one important dimension well, and the failure mode is treating them as a verdict on engineering as a whole. The 2025 archetypes lean into exactly this nuance by adding burnout and friction as explicit dimensions, because throughput bought at the cost of a wrecked team is not a win you get to keep. Pair the four keys with product and user-value signals and with team-health measures, and they become what they were always meant to be: a strong, honest compass for one part of the journey — not the map.