From scripts to pipelines: a CI/CD maturity model

The most common mistake in CI/CD advice is writing for the 19% of engineering teams that already qualify as elite performers while the other 81% try to apply practices that presuppose a level of automation they do not yet have. The four DORA key metrics — deployment frequency, lead time for changes, change failure rate, and mean time to recover — are not a ceiling to sprint toward. They are a coordinate system. This article is for teams somewhere between "someone runs a deploy script" and "every push canaries to 1% of traffic." Each level is concrete, with real configuration, real pitfalls, and a specific signal for when the rung you are on has stopped being your bottleneck.

Elite performer benchmarks

On-demand

Deploy frequency

multiple times/day

Under 1 hr

Lead time

commit to production

5–15%

Change failure rate

elite range

Under 1 hr

MTTR

from production incident

Source: DORA State of DevOps Report, 2024

Who is actually in the room: the real performance distribution

Before walking the levels, ground the conversation in who exists. DORA's 2024 State of DevOps survey categorised respondents into four clusters: elite (19%), high (22%), medium (35%), and low (25%). Roughly 60% of working engineering teams sit at medium or low performance. Advice calibrated for elite teams — "deploy on demand," "automated rollback on metric threshold" — is, for this majority, premature optimisation. Applying it before the foundational disciplines are in place adds tooling complexity without the payoff, and complexity without foundation is how well-intentioned pipeline investments create new incident sources rather than eliminating old ones.

DORA 2024 performance cluster distribution — most engineering teams are not elite

Low performers25%

Medium performers35%

High performers22%

Elite performers19%

Source: DORA State of DevOps Report, 2024

The 2024 DORA data included one notable anomaly: for the first time in the survey's history, medium-cluster teams showed a lower change failure rate than high-cluster teams. The most coherent explanation is that medium-performing teams deploy less frequently, so each individual change is more thoroughly validated before it ships. This is a useful reminder that raw throughput and stability are not the same axis. Optimising deploy frequency without investing in automated testing and progressive delivery can increase your incident rate even as it increases velocity — the kind of local optimum that looks like progress until it hurts.

The practical rule: climb the rung that closes your current most painful gap, not the rung described by the last conference talk you watched.

Level 0 — manual scripts and the invisible cost

Level 0 looks like this: a shell script lives in someone's home directory or a dusty deploy/ folder, the actual deploy steps are a mixture of SSH commands, Docker builds, and environment-specific flags that only the senior engineer knows, and the wiki page describing the process is three major product versions out of date. It often works fine for months. The system only breaks under conditions that do not occur until a growth event: a new engineer who does not know the secret flags, a machine that gets wiped, a deploy that has to happen on a Saturday morning by someone other than the usual person.

The costs compound quietly:

Knowledge lock-in. Deploy steps live in one person's muscle memory. When they are on leave or leave the company, the team either cannot ship or ships something wrong. This is not a process risk — it is an organisational risk that is invisible until it fires.

Non-determinism. A deploy from Alice's laptop on a Tuesday and from Bob's laptop on a Friday are not the same operation. PATH differs, tool versions differ, environment variables each person has set locally but never documented differ. The deploy "works" but it is not reproducible. The next time it does not work, the failure is not traceable to any single variable.

No audit trail. When something breaks in production, there is no record of what changed, who changed it, or when. Debugging an incident without a deployment log is archaeology. This detail moves incidents from "annoying" to "multi-hour outages" because reconstructing the deployment timeline consumes most of the incident response window.

Pitfall: the "good enough" script that calcifies

The most dangerous property of Level 0 is that it works for long enough to accumulate years of tribal knowledge. The script starts as a convenience. Gradually, special-case flags get added for each environment, a one-liner becomes 80 lines, and exceptions get baked in as magic constants. By the time it breaks badly, the script cannot be safely replaced without weeks of reverse-engineering what it actually does.

Fix: Treat the existing script as a specification, not as the implementation. Your first CI/CD task is to transcribe exactly what the script does into a version-controlled pipeline configuration. You are not improving anything yet — you are making it reproducible and auditable. Improvement comes at Level 1.

Signal to move up: More than one person deploys, or a deploy has failed because a step was skipped, run in the wrong order, or run with a local tool version that differs from the expected one.

Level 1 — continuous integration: the highest-ROI rung on the ladder

Every push builds and runs tests automatically. You catch breakage at the pull request, not in production. This is where teams below Level 1 should concentrate the most engineering effort. Nothing else on this ladder returns value faster or more consistently.

The test pyramid matters more than the CI tool

Teams new to CI frequently underinvest in unit tests and overinvest in end-to-end tests, which produces a pipeline that runs for 40 minutes and fails intermittently due to test flakiness. The right distribution for a typical web service:

| Layer | Proportion | Characteristic | |---|---|---| | Unit tests | 70% | Milliseconds each, no I/O, pure functions and classes | | Integration tests | 20% | Real adapters (database, cache) in a Docker Compose environment | | End-to-end tests | 10% | Slow and brittle; run nightly rather than on every PR |

Running the full end-to-end suite on every PR is the single most common mistake at this level. If your pipeline takes over 10 minutes, engineers push multiple commits before getting feedback and the pipeline stops functioning as a fast feedback loop. A well-structured PR pipeline for a medium-sized service should complete in under 5 minutes on the PR-blocking path.

A minimal but complete GitHub Actions workflow

name: ci
 
on:
  pull_request:
    branches: [main]
 
jobs:
  build-and-test:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Node
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
 
      - run: npm ci
 
      - name: Lint
        run: npm run lint
 
      - name: Type-check
        run: npm run typecheck
 
      - name: Unit and integration tests
        run: npm test -- --coverage --ci
 
      - name: Build artifact
        run: npm run build
 
      - name: Upload build artifact
        uses: actions/upload-artifact@v4
        with:
          name: build-${{ github.sha }}
          path: dist/
          retention-days: 7

The upload-artifact step at the end is not optional decoration — it is the foundation of the artifact immutability principle. Building the artifact in CI and uploading it means your deploy pipeline downloads the same binary that passed tests, rather than rebuilding at deploy time. A rebuild at deploy time is a second, untested build. In most toolchains this produces a different binary because build timestamps, dependency resolution order, and ambient environment variables vary. The artifact from CI is your promise that this commit passed; a second build breaks that promise silently.

Pitfalls at Level 1

Flaky tests. A test that fails one time in five is worse than a test that never existed. It trains engineers to re-run CI and ignore red pipelines. Treat flaky tests with the same urgency as production bugs: quarantine the test on first confirmed flake, file a ticket, and re-admit it to the suite only after 50 consecutive isolated passes. This sounds severe; it is the only approach that maintains the pipeline as a reliable signal.

No branch protection. CI is meaningless without branch protection rules that require it to pass before merge. GitHub's required status checks and GitLab's merge request approval settings enforce this. Without branch protection, engineers will merge failing PRs under deadline pressure and the pipeline signal degrades within weeks.

Linter warnings that do not fail the build. Running lint in CI but not blocking the merge on warnings is purely decorative. Either configure your lint rules to treat warnings as errors and fail the step, or remove the step. Half-measures reduce the signal-to-noise ratio of the pipeline and teach engineers that a red check is acceptable.

Signal to move up: CI is consistently green, PRs are merging with confidence, but deploying to staging or production is still a manual, high-anxiety procedure that requires someone senior to be online.

Level 0/1 boundary

Without CI

Broken code discovered in production after deploy
Build environment differs between each developer's laptop
Deploy steps undocumented; error-prone under pressure
No record of which tests passed before any given build

Level 1

With CI

Breakage caught at the pull request, before merge
Reproducible build on a standard CI runner for every push
Lint, type-check and tests gate the merge via branch protection
Artifact pinned to the exact commit SHA that passed

The practical difference between Level 0 and Level 1Source: DORA four keys framework; ClimsTech Engineering

Level 2 — continuous delivery: making deploys boring

Continuous delivery extends CI to produce a deployment-ready artifact on every merge to the main branch and deploys that artifact automatically, at minimum to a staging environment. Production promotion may still involve a human — a one-click approval — but the preparation for that click is entirely automated. The goal at this level is not speed; it is consistency. A deploy should feel as unremarkable as a code review. Teams at Level 1 that find themselves saying "we should deploy more often, but it's a process" are ready for Level 2.

Environments as code

The defining discipline of Level 2 is that environments are defined in version-controlled configuration, not in manually-maintained console state. Whether you use Terraform for cloud infrastructure, Kubernetes manifests in a GitOps repository, or Helm charts in a monorepo, the rule is the same: if you cannot recreate a staging environment from the repository, it is a snowflake, and it will eventually drift from production in a way that causes an incident you will spend hours debugging.

A GitOps-based image promotion with Kustomize looks like this:

# ops-repo/apps/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
images:
  - name: my-service
    newName: registry.example.com/my-service
    newTag: a3f9c21
patches:
  - path: replica-patch.yaml

Your CI pipeline on merge to main updates the tag with one command:

# After docker push in CI, update the ops repo and open a GitOps reconcile
cd ops-repo/apps/staging
kustomize edit set image my-service=registry.example.com/my-service:${GITHUB_SHA:0:8}
git add kustomization.yaml
git commit -m "deploy(staging): promote ${GITHUB_SHA:0:8}"
git push origin main

The GitOps controller (Flux or Argo CD) detects the changed kustomization and reconciles staging within 30 to 60 seconds. The Docker image at a specific SHA is immutable; only the pointer in the ops repo moves. Rollback is re-pointing the pointer — no re-build, no re-deploy, no running docker build in the middle of an incident.

A Level 2 CI/CD pipeline: build once, promote through environments

01
Commit to main
Merge to main triggers the pipeline. Direct pushes to main are blocked by branch protection — no exceptions.
02
Build and tag artifact
Docker image built, tagged with the git commit SHA, pushed to the container registry. This is the only build — it happens exactly once.
03
Promote to staging
CI updates the image tag in the ops repo. The GitOps controller detects the change and reconciles staging within 30–60 seconds.
04
Automated smoke tests
A targeted integration suite runs against the live staging environment. Failure blocks production promotion and notifies the commit author.
05
Production gate
A human reviews staging and approves. High-frequency teams remove this gate entirely once they have sufficient test coverage and monitoring.
06
Promote to production
Same image, same tag, pointed at production. Rollback is re-pointing the tag — no rebuild, under 60 seconds.

Source: ClimsTech Engineering; GitOps best practice

Pitfalls at Level 2

Snowflake staging environments. If your staging database schema has been manually modified, if staging runs services that production does not have, or if staging is sized so differently that timing-sensitive behaviours diverge, you will have false confidence. Staging green does not mean production green. The fix is infrastructure as code for all environments, plus database migration tooling (Flyway, Liquibase, or Atlas) so that schema state is reproducible. Run terraform plan as a required CI check on every infrastructure change.

Secrets in pipeline YAML. Storing database passwords or API keys as plain-text values in a CI pipeline configuration file is a credential breach waiting to happen. Use your CI provider's encrypted secrets store — GitHub Actions Secrets, GitLab CI Variables with the masked flag — and inject at runtime. Audit all pipeline YAML for any string that matches the shape of a credential. The breach is rarely from an external attacker reading the YAML; it is from an engineer logging the environment variables for debugging and forgetting to redact the output.

Rebuild at deploy time. Running docker build again at deploy time rather than using the image produced and pushed in CI is not continuous delivery — it is running CI twice. The second build is untested and, in most toolchains, not bit-for-bit identical to the first. Build once in CI; promote that image.

Signal to move up: You are deploying multiple times per week with confidence, but each bad deploy affects 100% of production traffic and takes 15 to 45 minutes to recover. The risk per deploy — not the friction of deploying — is now your bottleneck.

Level 3 — progressive delivery: controlling the blast radius

Progressive delivery ships to production in controlled increments: a canary that routes 1% of real traffic to the new version, a blue-green switch that can be reversed in seconds, or a feature flag that decouples the deploy from the release entirely. At this level, a bad change affects a small fraction of users for a short window rather than every user for a long one.

The economics are significant. A worked example:

Canary deployments with Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause:
            duration: 5m
        - setWeight: 20
        - pause:
            duration: 10m
        - setWeight: 50
        - pause:
            duration: 5m
      analysis:
        templates:
          - templateName: error-rate-check
        startingStep: 1
        args:
          - name: service-name
            value: my-service
  selector:
    matchLabels:
      app: my-service

The error-rate-check AnalysisTemplate queries Prometheus for the HTTP 5xx rate on the canary pods. If it exceeds a configured threshold — typically 1 to 2 percentage points above the stable baseline — Argo Rollouts aborts the rollout and returns all traffic to the stable version automatically. Human action is only needed to investigate why the rollback fired, not to perform the rollback itself.

A corresponding Prometheus analysis template:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 2m
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            sum(rate(http_requests_total{job="{{args.service-name}}",status=~"5.."}[2m]))
            /
            sum(rate(http_requests_total{job="{{args.service-name}}"}[2m]))
      successCondition: result[0] < 0.02

Feature flags: decoupling deploy from release

Feature flags change the risk model in a way that infrastructure-level canaries do not. A canary controls traffic routing; a flag controls application logic. With flags you can merge and deploy code to main days or weeks before it is visible to any user, run a genuine A/B test in production with real traffic, and roll back a bad feature by toggling a flag rather than reverting a commit and redeploying the entire service.

The operational risk of feature flags is flag debt: flags intended to be temporary accumulate in the codebase, each one adding a conditional branch that must be maintained and tested in both states. Hygiene is non-negotiable — ownership assigned at creation time, a sprint-level review calendar, and auto-expiry where your tooling supports it (LaunchDarkly stale flag detection, OpenFeature lifecycle metadata). A flag hygiene process is as important as the flagging infrastructure itself.

When Level 3 does not pay off

Progressive delivery is real infrastructure cost and real operational complexity. If you are deploying once per sprint, the overhead of operating Argo Rollouts, writing Prometheus analysis templates, and maintaining a feature-flag service costs more engineer-hours than it saves in incident response. The signal for Level 3 is deploy frequency and team scale, not sophistication. Get to daily or more frequent deployments first. Then progressive delivery starts paying for itself.

Signal that you are at the right level: Deploys are on-demand, change failure rate is in the DORA elite range, and failures are contained to small traffic subsets for short windows before automatic rollback fires.

Decision framework: matching level to team reality

| Team size | Deploy frequency | Primary pain | Right investment | |---|---|---|---| | 1–3 engineers | Weekly or less | Manual steps forgotten or error-prone | Level 1: CI on every PR, branch protection | | 3–8 engineers | Weekly | CI green but staging deploys are manual | Level 2: automated staging deploy, one-click production promotion | | 8–20 engineers | Daily | Bad deploys take down 100% of production traffic | Level 2 plus canary on production writes | | 20+ engineers | Multiple times daily | Feature releases block each other; incident rate too high | Level 3: progressive delivery plus feature flags plus flag hygiene |

This table is a heuristic, not a prescription. A 3-person team that deploys 10 times per day because the product requires it may legitimately need Level 3. A 50-person team with compliance-gated releases may run well at Level 2. The diagnostic question is always: what is the most painful thing about deploying right now? Fix that. Do not skip levels to reach a more sophisticated answer to a problem you do not yet have.

Common pitfalls that stall teams at every level

These patterns appear across teams regardless of tool selection and create a ceiling on maturity that no amount of tooling change alone fixes.

Flaky tests at scale. One flaky test is a nuisance. Thirty flaky tests in a suite of 2,000 means the pipeline fails on roughly one in three runs from intermittent causes unrelated to the change being reviewed. Engineers learn to re-run rather than investigate, and the pipeline stops being a reliable signal. The fix is a quarantine process: on a confirmed flaky failure, the test is skipped and a ticket is filed. The test earns its way back into the suite by passing 50 consecutive isolated runs. Track flaky test rate as a first-class metric on the same dashboard as build duration and success rate.

Long pipelines that create feedback dead zones. A PR pipeline that takes over 10 minutes means engineers push multiple commits before getting feedback. Profile your pipeline — GitHub Actions provides built-in per-step timing, GitLab has CI/CD analytics — identify the three slowest steps, parallelise what can be parallelised, and split integration tests into a post-merge job that does not block the PR. A realistic target for the PR-blocking path is under 5 minutes for a medium-sized service.

Deployment coupling in distributed systems. If deploying Service A requires simultaneously deploying Service B, you have a deployment coupling that makes each deploy riskier and more complex than it needs to be. The root cause is almost always a breaking API change deployed without a backwards-compatible transition period. The fix is API contract testing in CI — Pact is the standard tool for consumer-driven contracts — plus a policy that consumer-facing API changes must be backwards-compatible for at least one full release cycle before the old contract is retired.

Environment drift that only surfaces in production. Staging that is described as "basically the same as production" rarely is. Configuration differences, data differences, traffic-volume-triggered timing behaviours, and dependency version skew compound until something fails in production that passed in staging. A scheduled drift-detection job — one that diffs staging and production Terraform state, or compares deployed image tags across environments — catches this before it becomes an incident rather than during one.

Treating rollback as the primary safety net. "We can always roll back" is true but costly. A rollback is a deploy that runs under incident conditions by engineers who are stressed, often outside business hours, against a production system that may be partially degraded. The goal of progressive delivery is to make most bad deploys invisible before they reach full traffic — so that rollback is rarely needed. If you find yourself rolling back frequently, the answer is a canary with better analysis metrics, not a faster rollback process.

Building the business case for pipeline investment

Improvements to deployment pipelines rarely sell themselves to engineering leadership without a concrete framing. These are the numbers that tend to move decisions:

Incident recovery cost. If each production incident costs an average of 2 engineering hours to detect, triage, and resolve — a conservative estimate for a team without automated alerting — and you have 20 incidents per year, that is 40 engineer-hours, roughly a full engineer-week, spent on recovery work that creates no product value. Progressive delivery that auto-absorbs incidents before full traffic promotion reclaims that time directly and measurably.

Feedback cycle compounding. DORA 2024 data shows elite teams have a lead time of under 1 hour from commit to production. Medium performers average days to a week. A feature that takes 5 days to reach users instead of 2 hours delivers roughly 60 times fewer production feedback cycles in a month. Across a year of product development this is not a marginal velocity difference — it is a fundamentally different learning rate that compounds into different product outcomes.

Onboarding cost. A team at Level 0 cannot safely hand off a production deploy to a new engineer in under a week; the knowledge is non-transferable without a structured process. A team at Level 2 with documented, automated pipelines can have a new engineer run an autonomous production deploy on their second day. This is not a trivial benefit in an industry where the time-to-contribution curve is one of the primary hiring ROI drivers.

Elite-performing teams deploy on demand and recover from incidents in under an hour — not because they have better engineers, but because they have removed the manual steps where human error compounds under pressure.

— DORA State of DevOps Report, 2024

What to remember

DORA 2024: roughly 60% of engineering teams sit at medium or low performance. Calibrate investments to your current rung, not to what elite-cluster advice describes.
Level 1 — CI on every PR — is the single highest-ROI improvement available to any team that does not already have it. Do this before anything else.
Build once in CI, promote the same artifact through environments. A rebuild at deploy time is a second, untested build and breaks the artifact immutability guarantee.
Environments as code is the core discipline of Level 2. A staging environment that cannot be recreated from the repository is a snowflake that will eventually cause an incident.
Progressive delivery pays off only when deploy frequency is high enough to generate incidents worth auto-absorbing. Get to daily deployments first; then canaries start paying for themselves.
Flaky tests are a pipeline integrity problem, not a test quality problem. Quarantine and fix them with the same urgency as production bugs — they erode the signal your pipeline provides.
The diagnostic question at every level is the same: what is the most painful thing about deploying right now? Fix that one thing. Do not skip rungs to reach a more sophisticated answer to a problem you do not yet have.
The goal is that deploying is boring, safe and fast for your team. Sophistication is a means to that end, not the end itself.