AI is an amplifier, not a shortcut: reading the 2025 DORA signal

The 2025 edition of DORA's research turned its attention to AI-assisted software development, and the headline finding is more interesting than the usual "productivity is up." Yes, throughput rose. But stability often fell at the same time — and whether AI helped or hurt overall depended almost entirely on the quality of the platform underneath it. Across nearly 5,000 surveyed technology professionals, the pattern was consistent: AI adoption correlated with more output and more rework. The teams that escaped that trap had one thing in common — strong internal platforms with fast CI, automated testing, observable systems, and a culture that treats reversibility as a first-class property. AI, it turns out, is an amplifier. It makes a strong delivery system stronger and a weak one weaker, and it does so at speed, which is the uncomfortable part.

This post goes deep on what the data actually says, what the code-level evidence adds on top of it, and what you should concretely do if you want AI to compound your delivery quality rather than your defect rate.

2025 DORA survey — the state of AI adoption

~90%

Developers using AI at work

up 14 points year-over-year

65%

Heavily reliant on AI tools

past the experimenting phase

~30%

Little or no trust in AI output

adoption does not equal confidence

80%+

Report significant productivity gains

but stability tells a different story

Source: DORA, State of AI-Assisted Software Development, 2025 (n ≈ 5,000)

What the 2025 DORA data actually says

DORA's 2025 report is the first in the series to focus specifically on AI-assisted development rather than treating AI as one factor among many. The survey covered nearly 5,000 professionals, weighted by team size and industry. Four findings stand out.

Throughput rose — genuinely. AI adoption correlated with higher deployment frequency and shorter lead times. Teams using AI tools were shipping more code, faster. This is not contested and it is not surprising; AI assistants are demonstrably fast at producing syntactically correct code for well-specified problems.

Stability fell — for many teams. The same cohort showed higher change failure rates and longer times to restore service when incidents occurred. More code, shipped faster, with more of it needing correction. For the roughly 80% of respondents who reported productivity gains, a meaningful share of those gains was being consumed by rework. DORA's 2024 analysis found a 7.2% decrease in delivery stability associated with a 25% increase in AI adoption — and the 2025 report reaffirms the direction of that relationship. A figure worth examining carefully, not dismissing.

The mediating variable is platform quality. The teams that achieved a net benefit — more throughput and maintained stability — consistently had stronger internal platforms: fast automated CI, meaningful test coverage, observable systems, and deployment mechanisms that make rollback cheap. The DORA four keys (deployment frequency, lead time for changes, change failure rate, mean time to restore) were healthy before AI arrived; AI pushed them further in the right direction for these teams. For teams with weak platforms, AI pushed the failure metrics further in the wrong direction.

User-centric focus is a secondary predictor. Teams that grounded their work in user research and feedback loops — not just output velocity — captured the strongest benefits. AI without a signal about whether users actually want what you shipped faster is a more efficient way to generate waste.

One figure deserves attention: roughly 90% of the surveyed organisations reported having an internal developer platform of some kind. But having a platform and having a good platform are different things. The correlation DORA found was not between platform existence and AI benefit — it was between platform quality and AI benefit. A Backstage installation with no golden paths and three hand-maintained YAML files does not constitute a strong platform.

The 2025 DORA thesis

As AI adoption rises, throughput and stability pull apart — a high-quality platform is what closes the gap.Source: Illustrative, after DORA 2025

The code quality debt you are shipping faster

DORA measures outcomes at the delivery-pipeline level. To understand what is happening at the code level, GitClear's 2025 research offers a complementary lens. Their team analysed 211 million lines of changed code spanning 2020 to 2024 — across anonymised private repositories and 25 large open-source projects — one of the most substantial empirical analyses of AI-assisted coding published to date.

The findings are worth sitting with.

Code churn roughly doubled. The share of newly-added code revised within two weeks climbed from 5.5% in 2020 to 7.9% in 2024 — a 44% relative increase. Overall code churn rose from a pre-AI baseline of 3.3% to 7.1% in 2025. Churn is expensive: it consumes reviewer time, adds CI runs, and generates changelog noise that obscures meaningful changes in your history.

Copy-paste code increased by 48% relative. Duplicated code blocks went from 8.3% of all changes in 2020 to 12.3% in 2024. AI assistants are very good at producing plausible-looking code that solves the immediate problem; they are much less inclined to notice that this solution already exists three files over and should be abstracted.

Refactoring collapsed. The percentage of "moved" lines — code that was restructured rather than simply added — fell from 24.1% in 2020 to 9.5% in 2024. Developers are writing more and reorganising less. That is a reliable early signal of accumulating technical debt: the codebase grows faster than it is tidied.

Code quality signals under AI-assisted development, 2020 vs 2024

Refactored (moved) code — 2020 baseline24.1%

Refactored (moved) code — 20249.5%

Copy-paste code — 2020 baseline8.3%

Copy-paste code — 202412.3%

Source: GitClear, 2025 (211M lines analysed, private repos + top 25 open-source projects)

A worked example of the rework cost. Consider a team shipping 10,000 lines of new code per sprint. At a 2020-era churn rate of 5.5%, around 550 lines will need revision within two weeks — call that roughly 28 hours of review and rework at a conservative estimate of 20 lines per hour. At a 2024-era churn rate of 7.9%, that is 790 lines and about 40 hours. The delta — 12 hours per sprint — looks modest. Multiply it across five engineers and four sprints per month, and you are looking at approximately one full engineer-month of rework annually that did not exist before AI. Rework that rarely appears on any velocity metric because it looks like regular work.

The independent code analysis firm CodeRabbit found approximately 1.7 times more issues per pull request in AI-coauthored code compared to fully human-authored code (CodeRabbit, December 2025 analysis). The security picture is worse: multiple independent analyses through 2024 and 2025 found that between 40% and 50% of AI-generated code contains at least one security vulnerability, though figures in that range vary by study methodology and should be treated as illustrative of the order of magnitude rather than a precise benchmark.

AI doesn't fix a broken delivery system. It exposes it — at speed.

— The practical reading of DORA 2025

Platform quality is the deciding variable

The reason platform quality moderates AI outcomes is not mysterious once you examine the feedback loop. AI tools accelerate the rate at which code is written. If your platform provides fast, trustworthy feedback on whether that code is correct, the acceleration is net-positive: you find problems quickly and fix them quickly. If your platform provides slow or noisy feedback, the acceleration is net-negative: AI helps you generate a larger queue of uncertain work that takes days or weeks to resolve.

The practical implication: before expanding AI tooling licences or rolling out AI-generated infrastructure code, the questions to answer are about your platform. How long does it take for a developer to know whether their change passed? Under 10 minutes is the threshold worth targeting; over 30 minutes is a signal that AI is going to hurt more than it helps. What fraction of CI failures are genuine signal versus flaky tests? A flaky-test rate above 5% means developers have already learned to distrust their pipeline — AI will compound that distrust. When a deployment fails, how long does rollback take? If rollback is measured in hours, AI-generated changes accumulate blast radius faster than you can contain it.

amplified risk

Weak platform + AI

No automated test gates — AI-generated regressions ship silently
30-minute CI cycles mean developers stack changes without waiting for results
Flaky tests above 5% — developers tune out pipeline feedback entirely
No trace coverage — incidents take longer to diagnose despite AI triage suggestions
Change failure rate rises; rework consumes the throughput gains AI produced
Developer trust in AI erodes — tool half-adopted, used inconsistently

amplified safety

Strong platform + AI

Sub-10-minute CI catches regressions immediately; AI suggestions stay accountable
Preview environments per PR let AI-generated code prove itself before merge
Flaky test rate below 2% — pipeline signal is trustworthy, feedback is acted on
High trace and log coverage — AI triage is accurate because the signals are clean
DORA four keys stay healthy as throughput rises; lead time and failure rate move together
Developer trust in AI grows because mistakes are caught early and reversals are cheap

The same AI tooling, opposite outcomes — platform quality is the deciding variableSource: After DORA 2025

The user-centric finding from DORA deserves a practical translation. Teams with strong user-centric focus — regular user research, outcome-based metrics, feedback loops from production usage — get more from AI because they can quickly determine whether an AI-assisted feature actually solved the user problem. Teams that measure output (commits, velocity points, PRs merged) rather than outcomes (feature adoption, error rates, user retention) see AI inflate their output metrics while hiding the fact that the outcomes are not improving. AI accelerates the cycle; the feedback mechanism determines whether the cycle converges on the right thing.

Hardening your delivery pipeline for AI-assisted teams

Platform hardening for AI-assisted teams is not fundamentally different from good platform engineering — it is the same work, with urgency raised by the fact that AI multiplies throughput, which multiplies the rate at which any gap in your safety net gets exercised.

Enforce quality gates in CI

The most immediate lever is a CI pipeline that runs fast and fails clearly. For teams on GitHub Actions, a minimal enforcement layer looks like this:

name: Quality Gate
 
on:
  pull_request:
    branches: [main, develop]
 
jobs:
  quality:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
 
      - run: npm ci
 
      # Unit + integration — must finish in under 5 minutes
      - name: Run tests with coverage
        run: npm test -- --coverage --maxWorkers=4
        timeout-minutes: 5
 
      # SAST — catches common AI-generated security antipatterns
      - uses: github/codeql-action/analyze@v3
        with:
          languages: javascript
 
      # Dependency audit — AI pulls in packages it was trained on, not the latest safe versions
      - run: npm audit --audit-level=high

The timeout-minutes: 10 constraint on the job and timeout-minutes: 5 on the test step are not cosmetic. If your test suite cannot run in 10 minutes, that is the engineering problem to solve — not "how do we add more AI tooling." Fast CI is the foundation on which everything else runs.

IaC policy gates for AI-generated infrastructure

Infrastructure-as-code generated by AI passes terraform validate and terraform plan but can silently violate your security posture — open ingress rules, overly-broad IAM policies, missing encryption at rest. Checkov or Conftest policies in CI catch the most common classes of over-permission that AI generates. A minimal Checkov configuration for S3 enforcement:

# .checkov.yaml — policy gates for AI-generated IaC
check:
  - CKV_AWS_19   # S3 server-side encryption at rest
  - CKV_AWS_57   # S3 public read policy block
  - CKV_AWS_53   # S3 bucket versioning enabled
  - CKV2_AWS_6   # S3 block public access setting
  - CKV_AWS_18   # S3 access logging enabled
  - CKV_AWS_145  # S3 KMS-managed encryption
soft_fail: false

Run this in CI with checkov -d terraform/ --config-file .checkov.yaml before every plan. Twenty minutes of configuration work; it prevents the class of IAM and storage misconfiguration that AI introduces most frequently.

Track churn as a sprint metric

Churn rate is computable from your git history. A simple proxy: count the number of commits in a sprint that modify code added in the previous two weeks, expressed as a ratio of all commits. A ratio above 0.3 is worth investigating. Above 0.5 means you are in rework mode and AI throughput gains are not compounding — you are running to stand still.

Classify AI use cases by oversight level

Not all AI use cases carry the same risk. This table maps common uses to appropriate oversight:

| Use case | Trust level | Minimum gate | |---|---|---| | Boilerplate, scaffolding, test stubs | High | Standard PR review | | New business logic | Medium | Reviewer explicitly checks AI-generated sections | | Modifying existing core logic | Medium-Low | Two-reviewer approval plus regression test | | Security-sensitive code (auth, crypto, payments) | Low | Mandatory SAST plus dedicated security review | | Infrastructure-as-code changes | Low | Plan review, policy check, manual approval | | Incident root-cause analysis | High | Informational only — no gate needed | | Automated remediation in production | Very Low | Human approval required before any mutation |

The "security-sensitive code" and "infrastructure changes" rows are where teams get hurt without realising it. The fix is not to prohibit AI in those areas — it is to add the appropriate gate, which costs far less than the incident it prevents.

AI in operations: investigation yes, autonomous mutation no

The same amplifier logic applies when AI moves from development into operations. AI is genuinely useful in incident response — it can correlate signals across logs, metrics, and traces at a speed no human matches, surface recent deploys that correlate with the incident window, and propose a ranked list of likely causes. These are high-value, low-risk uses. The risk arrives when AI is given authority to act on its conclusions without a human reviewing the reasoning first.

The pattern that holds up in production draws a clear line between investigation and mutation:

AI-assisted incident response — human gate before any production mutation

01
Alert
A symptom-based page fires from SLO error-budget consumption — not just a raw threshold on a single metric. This reduces noise and ensures the AI triage context is meaningful.
02
AI triage
Correlate logs, metrics, traces, and the deploy history for the preceding 2 hours. Produce a structured summary: likely root cause, confidence level, top 3 candidate changes. No action taken yet.
03
Human gate
The on-call engineer reads the AI summary and verifies the reasoning before approving any change. A well-structured AI summary makes this review take under 3 minutes — not a bottleneck, a checkpoint.
04
Scoped action
Apply the fix with the smallest possible blast radius: a rollback, a targeted config change, a scale event. Never a broad auto-remediation. Each action is logged with the approving engineer's identity.
05
Post-incident signal
Record whether the AI root-cause was correct. Accuracy improves over time only if you track where it is wrong. Add a field to your retrospective template — it takes five minutes and gives you the data to trust or calibrate your tooling.

Source: ClimsTech Engineering practice

The last step — capturing whether the AI diagnosis was accurate — is overlooked by most teams. Confidence and accuracy are different things. An AI model that was trained on your incident history will produce confident-sounding summaries; whether those summaries are correct is an empirical question that requires measurement.

A concrete failure mode illustrates why the human gate matters: a team configured their AI operations tool to automatically restart Kubernetes pods when it detected elevated memory correlated with a known OOM pattern. The AI diagnosis was correct roughly 85% of the time. The remaining 15% involved restarts during active database migrations and write-heavy batch jobs, where pod restarts caused data inconsistencies that took six hours to resolve. The correct response was not to disable AI operations. It was to keep the AI diagnosis and require human approval before the restart action. The difference in operational overhead was minimal; the difference in blast radius was enormous.

Five real pitfalls and how to fix them

Every team adopting AI-assisted delivery runs into a predictable set of failure modes. These are the five that appear most consistently.

Pitfall 1: Treating AI output as reviewed because it was reviewed. A developer reviews an AI-generated PR the same way they skim a junior engineer's work — checking structure, not interrogating logic. AI-generated code has no model of the broader system context. It solves the stated problem without knowing what assumptions the surrounding code makes, what invariants the system depends on, or what edge cases are handled elsewhere. Fix: explicitly label AI-generated sections in PRs (GitHub Copilot adds Co-Authored-By: to commits automatically; require this for any AI-assisted code) and apply a higher scrutiny standard to those sections, not a lower one.

Pitfall 2: Expanding AI licences before stabilising CI. Teams roll out Copilot or Cursor company-wide before their CI pipeline is reliable. The result is a flood of PRs with inconsistent quality and a CI system that cannot keep up or is not trusted. Fix: measure your current pipeline health — flaky-test rate, CI wall time, average time-to-merge — before expanding AI adoption. If any of those numbers is bad, the pipeline is the right investment first.

Pitfall 3: Generating tests with the same AI that generated the code. AI that writes the implementation and then writes the tests for it tends to produce tests that verify what the code does, not what it should do. Coverage percentage looks fine; mutation score is poor. Fix: require that test cases for AI-generated logic either be written by a human or generated by a separate AI prompt with an adversarial framing ("write tests that would catch bugs in this function, including edge cases the implementation might miss"). These are different prompts and they produce meaningfully different tests.

Pitfall 4: AI-generated IaC with no policy enforcement. As described above, AI produces syntactically valid Terraform that passes terraform plan but violates your security posture. This class of error is invisible until an audit or an incident. Fix: Checkov or Conftest policies in CI, running on every IaC pull request, with hard failures for critical checks. The policy configuration shown in the previous section catches the most common variants.

Pitfall 5: AI-generated commit messages as the audit trail. AI commit messages are fluent and empty. "Updated authentication logic to improve security" tells you nothing about what changed or why. In a post-incident review, or during an audit, this is a serious problem. Fix: require a commit message template that includes a structured change rationale and a link to the issue or decision record. AI can draft the prose; a human must supply the reasoning. The distinction matters.

What to remember

AI adoption is near-universal (~90%); genuine trust in AI output is not — roughly 30% of developers report little or no confidence in what it generates (DORA 2025).
AI raises throughput and can raise instability simultaneously. Platform quality — fast CI, low flaky-test rate, cheap rollback, solid observability — is the variable that determines which direction the amplification runs.
Code quality signals are deteriorating under AI: code churn roughly doubled, copy-paste code up ~48%, and refactoring down by over 60% between 2020 and 2024 (GitClear, 2025, 211M lines).
Harden the pipeline before expanding AI adoption: sub-10-minute CI, flaky-test rate below 2%, policy gates on IaC, SAST on AI-assisted PRs, and a churn metric in your sprint retrospective.
Let AI investigate and summarise freely in incident response, but require a human approval gate on any action that mutates production state — and track whether the AI diagnosis was actually correct.
Platform investment and AI investment are sequential, not competing. Fix the platform first; then let AI compound on top of it.