Zero-downtime deployments: rolling, blue-green and canary

Most teams that suffer deploy-related outages picked the right strategy. They configured a rolling deploy, tested blue-green in staging, maybe even set up a canary. Then the migration ran, the old version started throwing 500s on a column that no longer existed, and the "instant rollback" they were counting on did absolutely nothing about the database. Zero-downtime deployment is not one problem — it is three simultaneous problems: traffic routing, service replacement, and schema migration. Each has different rollback properties, different latency budgets, and different blast radii. Teams that conflate them will eventually have a bad night. This article separates the three, gives you the real Kubernetes configuration for each traffic strategy, and explains why the database section is the only one that should actually keep you up at night.

What a failed deploy actually costs

Before choosing a strategy, understand what you are buying insurance against.

Gartner's widely cited baseline puts IT downtime at roughly $5,600 per minute — a figure from their 2014 study that has been quoted so many times it has taken on a life of its own. More recent survey data paints a worse picture: ITIC's 2024 Hourly Cost of Downtime Survey found that more than 90% of mid-size and large enterprises lose over $300,000 per hour of downtime, and 41% of large enterprises lose between $1 million and $5 million per hour. EMA Research, cited by BigPanda in 2024, puts the average at $14,056 per minute for unplanned outages — rising to $23,750 per minute for large enterprises.

These numbers include direct revenue loss, SLA penalties, support labor, and reputation damage. They do not include the secondary cost that engineers rarely measure: the opportunity cost of an all-hands incident pulling your team off roadmap work for two to four days afterward.

DORA 2024 elite performance benchmarks

On-demand

Deploy frequency

multiple times/day

~5%

Change failure rate

elite threshold

Under 1 hr

Failed deploy recovery

renamed from MTTR in 2024

Under 1 hr

Lead time for changes

commit to production

Source: DORA State of DevOps Report, 2024

The DORA 2024 report added a fifth metric — rework rate — and refined the recovery definition to focus specifically on failed-deployment recovery rather than any infrastructure failure. Elite performers recover from bad deploys in under an hour, partly because their deploy strategies are engineered for rapid rollback, and partly because they have the observability to detect a bad deploy within minutes of it hitting production.

The takeaway is not "deploy more." It is: your deploy process should make rollback the cheapest, fastest operation available, so that the cost of a bad deploy is bounded by how quickly you can trigger one.

DORA's throughput vs. stability lens

DORA clusters teams into four performance tiers — elite, high, medium, and low — based on a composite of deployment frequency, lead time for changes, failed-deployment recovery time, and change failure rate. In 2024, something unusual happened: the medium cluster began showing lower change failure rates than the high cluster. DORA interpreted this as high performers deliberately accepting slightly higher failure rates in exchange for faster deployment cadence — a conscious trade-off, not a measurement error.

This matters for deploy strategy selection. A team deploying several times per day cannot afford the operational overhead of maintaining two full environments for every deploy. Their risk model is different: frequent deploys are individually smaller and therefore lower-risk, making rolling deploys with good observability the economically rational choice. A team deploying weekly ships larger batches, faces higher per-deploy risk, and benefits more from a canary or blue-green gate.

High performers accept a slightly higher change failure rate as the price of deploying more frequently — it is a deliberate trade, not a measurement error.

— DORA State of DevOps, 2024

Rolling deploys: default, not fallback

Kubernetes Deployments implement rolling updates by default. The core configuration that most teams leave at default values is worth understanding precisely:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0     # never reduce below 6 healthy pods
      maxSurge: 2           # allow up to 8 pods during rollout
  template:
    spec:
      containers:
        - name: api
          image: payments-api:v2.4.1
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 3
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /healthz/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10

The pair that matters: maxUnavailable: 0 guarantees you never drop below full capacity during the rollout. maxSurge: 2 controls how many excess pods run simultaneously. On a 6-replica Deployment with maxSurge: 2, Kubernetes creates 2 new pods, waits for them to pass their readiness probe, removes 2 old pods, and repeats. With maxUnavailable: 1 and maxSurge: 0, you replace one pod at a time — slower, but no extra resource cost.

The readiness probe is the mechanism that makes rolling safe. If a new pod fails its readiness probe, Kubernetes stops the rollout and leaves existing pods in place. The most common misconfiguration is a readiness probe that passes before the application is actually ready — returning 200 before the JVM has finished warming up its connection pools, or before the service has loaded its config from a secret store. Your probe endpoint must actually exercise the critical path, not just assert "process is alive."

Rollback under rolling

Rolling rollback — kubectl rollout undo deployment/payments-api — is itself a rolling operation. On 6 replicas, expect 4 to 12 minutes depending on pod startup time and image pull speed. This is the fundamental limitation of the strategy: if your deploy broke production, production stays broken for the full duration of the rollback. For most internal tooling and low-traffic services this is acceptable. For anything handling user-visible transactions at scale, it is not.

Blue-green deploys: instant rollback at a price

Blue-green solves the rollback latency problem by keeping the previous version running alongside the new one. Traffic is routed by a Kubernetes Service selector (or an upstream load balancer rule). Rollback is changing that selector back — which propagates to kube-proxy in well under a second.

# Active service — selector controls which version receives traffic
apiVersion: v1
kind: Service
metadata:
  name: payments-api
spec:
  selector:
    app: payments-api
    slot: blue      # change to "green" to cut over; revert to rollback
  ports:
    - port: 80
      targetPort: 8080

The green Deployment runs at full replica count before the selector is switched. You smoke-test against green directly via a separate service (payments-api-green) before touching the live selector. Cutover is a single kubectl patch:

kubectl patch service payments-api \
  -p '{"spec":{"selector":{"slot":"green"}}}'

Rollback is the same command with slot: blue. The total time from "something is wrong" to "traffic back on old version" is typically under 30 seconds — limited by kube-proxy propagation, not pod startup.

The cost is real and should not be underestimated. During the cutover window you run two full Deployments at 100% replica count each. For a service sized to handle peak load on 20 pods, you are briefly running 40. On a team deploying daily across 10 services, the cumulative cost of maintaining two environments during cutover windows adds up. The window is usually short — minutes for automated pipelines — but it requires capacity headroom that must be pre-provisioned.

Blue-green is the right default for payment processing, authentication, and any flow where a mixed rollout — users on v1 and v2 simultaneously — creates functional inconsistencies that are hard to reason about or remediate.

Canary releases: evidence-based progression

A canary is not a slow rolling deploy. It is a controlled experiment: route a small, measurable fraction of traffic to the new version, observe real metrics (not just "is it up"), and use those metrics to decide whether to advance or abort. Without automated analysis driving the decision, a canary is just a rolling update with extra steps and more ceremony.

Argo Rollouts is the de facto standard for canary deployments on Kubernetes. Its AnalysisTemplate resource lets you define success criteria before the deploy begins:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              job="payments-api",
              status!~"5..",
              version="{{args.version}}"
            }[5m])) /
            sum(rate(http_requests_total{
              job="payments-api",
              version="{{args.version}}"
            }[5m]))

The Rollout object then references this template at each weight step:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments-api
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: version
                value: "{{inputs.parameters.revision}}"
        - pause: {duration: 10m}
        - setWeight: 25
        - analysis:
            templates:
              - templateName: success-rate
        - pause: {duration: 10m}
        - setWeight: 100

At each step, Argo evaluates the analysis. If success rate drops below 95% three consecutive times, the Rollout automatically aborts and routes all traffic back to the stable version. The blast radius is bounded: during the initial 5% step, fewer than 1 in 20 users ever touches the new code.

instant rollback

Blue-green

Routes 100% of traffic to new version after green passes smoke test
Rollback is a load balancer flip: typically under 30 seconds
Requires 2x capacity during the cutover window
All users hit the new version simultaneously
Right for: payment flows, auth, schema-sensitive endpoints

evidence-based

Canary

Routes a small slice (1-5%) to the new version first
Automated analysis gates each progression step
Rollback triggers automatically when analysis fails
Blast radius is bounded to the canary traffic percentage
Right for: high-traffic APIs, risky logic changes, ML model updates

Blue-green vs canary: same goal, fundamentally different risk profileSource: ClimsTech Engineering; Argo Rollouts documentation, 2024

Typical rollback duration by deployment strategy (lower is better)

Rolling deploy (undo)8-12 min

Blue-green (flip back)under 1 min

Canary - automated abort1-2 min

Feature flag toggleunder 15 sec

Source: Industry consensus; actual times vary by cluster size, image pull speed, and pod startup

Canary without observability is theater

The single most common canary failure mode is having the mechanism in place but not the metrics. If your Prometheus query returns no data — wrong label names, absent metric, clock skew between scrapes — the AnalysisTemplate will either silently pass or immediately fail depending on your failureLimit config. Test your analysis queries against historical data before a live canary. If the query returns zero for the canary version during the smoke test, your safety gate has no teeth.

The database: where zero-downtime actually fails

Traffic routing is the easy part. The thing that ends zero-downtime deploys is the database.

The failure mode is predictable. You need to rename a column. Version N+1 of your application uses user_email_address. Version N uses email. During the rolling update, pods on both versions run simultaneously. Version N reads email and gets data. Version N+1 reads user_email_address and gets null or throws an error. You have a partial outage for the entire duration of the rollout — and "rollback" restores the N pods but leaves rows written by N+1 with the new column name, breaking reads on the old code.

The fix is expand-migrate-contract, a pattern documented by Martin Fowler and Pramod Sadalage in "Evolutionary Database Design" and "Refactoring Databases." It treats database and application changes as separate deploys with a strict sequencing discipline.

Expand-migrate-contract: the only safe schema change pattern for rolling and canary deploys

01
Expand
Add the new shape alongside the old. The new column is nullable or has a default so existing rows are valid. Old application code ignores the new column and continues to work. New application code writes to both old and new columns simultaneously.
02
Migrate
Deploy the new application version with dual-write enabled. Backfill existing rows in batches — never a single UPDATE across the full table on an indexed range condition. Application now reads from the new column but still writes to both for safety during the transition.
03
Contract
Ship a follow-up deploy that drops the old column and removes the dual-write path. This deploy ships only after 100% of instances run the new version and after you have confirmed via query logs that nothing reads the old column.

Source: Martin Fowler and Pramod Sadalage, 'Refactoring Databases', 2006; Martin Fowler, evolutionaryDatabase, martinfowler.com

The column rename above requires three deploys, not one. That is the price of zero-downtime migrations. Teams that resist this overhead usually pay it as incident time instead.

Practical SQL for expand-migrate-contract

The expand phase adds the new column without constraints:

-- Deploy 1: Expand (safe to run while old code is live)
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
-- Do NOT add NOT NULL here — existing rows have no value yet

The backfill runs in batches to avoid table locks and replication lag:

-- Batched backfill: run from a migration script or background job
UPDATE users
SET    email_address = email
WHERE  email_address IS NULL
  AND  id BETWEEN :min_id AND :max_id;
-- Repeat in chunks of 1,000-10,000 rows with a short pause between batches

A single UPDATE users SET email_address = email without a WHERE clause on a table with tens of millions of rows will lock the table, spike replication lag on your read replicas, and likely hit your cloud database's statement timeout. Batch it.

The contract phase, shipped after full rollout:

-- Deploy 3: Contract (only after every running pod is on the new version)
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users ALTER COLUMN email_address SET NOT NULL;

Feature flags: the orthogonal layer

Feature flags are not a deployment strategy. They are a mechanism for decoupling code deployment from feature activation, and used alongside any of the three strategies above, they eliminate the riskiest class of deploys: the ones where the code change is safe but the business logic change is not.

The model is simple: deploy code with new behavior gated behind a flag. The deploy itself becomes inert — all traffic follows the old path. Once the deploy is fully rolled out and healthy, activate the flag, starting at a small user percentage and ramping. Rollback is toggling the flag off, not re-running a deploy.

Services like LaunchDarkly, Unleash (self-hosted), and Flipt provide targeting rules that control rollout by user segment, region, or random percentage. The key discipline that teams consistently underinvest in is flag lifecycle management. An activated flag that has been stable for two weeks should be removed from the codebase in the next sprint. Flag debt accumulates silently, creating conditional branches that are impossible to fully test and make incident debugging significantly harder.

Decision framework

The right strategy depends on three variables: how risky is the change, how fast must rollback complete, and what is your available capacity budget.

| Scenario | Recommended strategy | Reasoning | |---|---|---| | Routine bug fix, low-traffic service | Rolling | Low blast radius; rollback latency acceptable | | Payment or auth service, any change | Blue-green | Mixed versions are unsafe; rollback speed is critical | | High-traffic API, risky logic change | Canary | Bounded blast radius; automated abort limits exposure | | Schema change alongside any deploy | Expand-migrate-contract | Non-negotiable regardless of traffic strategy | | Risky business logic, stable code | Feature flag | Decouple deploy from activation entirely | | Daily deploys, elite DORA tier | Rolling + feature flags | Blue-green overhead is too high at this cadence | | ML model update or algorithmic change | Canary with latency analysis | Error rate alone won't catch quality regressions |

No single answer fits all services in an organization. A realistic platform deploys the majority of services via rolling update and reserves blue-green or canary for the 20% of services where a bad deploy has the highest blast radius.

Pitfalls and fixes

1. Readiness probe returns 200 before the app is actually ready

The probe passes, Kubernetes routes traffic, the first real request hits a service that has not connected to its database yet. Fix: your readiness endpoint must verify every critical downstream dependency — database connection pool, Redis, any synchronous config store. A failing dependency returns 503. Accept the slightly longer startup time; it is far cheaper than the alternative.

2. The canary gets no real traffic weighting

Your ingress routes to the Service, the Service selector matches all pods including canary pods, but Argo's traffic splitting is not wired to the ingress controller. Argo Rollouts requires integration with a supported ingress controller (NGINX, Traefik, ALB) or a service mesh (Istio, Linkerd). Without the integration, traffic splitting is best-effort based on pod count ratios, not configured percentage weights. A 5-pod canary on a 20-pod stable version will receive approximately 20% of traffic regardless of your setWeight: 5 config. Read the Argo Rollouts traffic management documentation for your specific ingress before assuming the weight is enforced.

3. Blue-green with long-lived in-flight connections

A microservice maintains persistent connections when the selector switches. Some proxies drain connections gracefully; others close them immediately. Old blue pods continue handling in-flight requests for up to several seconds after the selector flip. Add a preStop lifecycle hook with a sleep matching your upstream proxy's drain timeout:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]
terminationGracePeriodSeconds: 30

This gives the load balancer time to stop routing new requests to the pod before it begins shutting down. The terminationGracePeriodSeconds must be longer than the preStop sleep or Kubernetes will SIGKILL the pod before in-flight requests complete.

4. Schema migration runs after new pods start receiving traffic

CI/CD pipelines that run migrations as a post-deploy step expose a window where the new application code is live against the old schema. Always run migrations before new pods start receiving traffic. In Kubernetes, implement this as an init container or a pre-deploy Job:

initContainers:
  - name: run-migrations
    image: payments-api:v2.4.1
    command: ["./migrate", "--run"]
    env:
      - name: DATABASE_URL
        valueFrom:
          secretKeyRef:
            name: db-credentials
            key: url

The init container runs the migration, exits 0, and only then does Kubernetes start the main container and route traffic to it. Because your migrations follow expand-migrate-contract, they are additive-only and therefore safe to run against the live database while the previous version still serves traffic.

5. PodDisruptionBudget conflicts during rolling updates

A rolling deploy and a concurrent node drain can compound their maxUnavailable effects. If your PodDisruptionBudget allows 1 unavailable pod, and the rolling update also terminates 1 pod, you can briefly have 2 unavailable pods simultaneously — silently breaching the PDB. Define PDBs explicitly and test them with kubectl drain in staging before encountering this combination in a production incident:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payments-api-pdb
spec:
  minAvailable: "75%"
  selector:
    matchLabels:
      app: payments-api

Setting minAvailable as a percentage handles replica count changes automatically. An absolute value of minAvailable: 4 on a Deployment you later scale down to 3 replicas will prevent node drains from proceeding at all.

6. Canary analysis only checks error rate

A new version that responds with 200 but at three times the latency degrades user experience without tripping an error-rate gate. Add a latency percentile query alongside the success-rate check:

- name: p99-latency
  interval: 60s
  successCondition: result[0] <= 0.5   # 500ms p99 ceiling
  failureLimit: 2
  provider:
    prometheus:
      address: http://prometheus.monitoring:9090
      query: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket{
            job="payments-api",
            version="{{args.version}}"
          }[5m])) by (le)
        )

For ML-serving endpoints, you may also want a business-metric check — click-through rate, conversion rate, or model confidence score — alongside the infrastructure signal. Argo Rollouts supports multiple analysis metrics in parallel; a rollout advances only when all metrics pass.

What to remember

Rolling is the correct default for most services — set maxUnavailable: 0 and a non-zero maxSurge so Kubernetes never drops below full capacity during a rollout.
Blue-green gives you sub-30-second rollback at the cost of 2x capacity during cutover. Justified for payment flows, auth, and any endpoint where mixed versions are functionally unsafe.
Canary without automated Prometheus or Datadog analysis is just a slow rolling deploy with extra overhead. Wire the AnalysisTemplate before the first production canary.
The database is always the hardest part. Expand-migrate-contract is not optional — it is the only pattern that makes rolling and canary safe for schema changes.
Run migrations before new pods receive traffic. Use an init container or a pre-deploy Kubernetes Job; never rely on a post-deploy migration step.
Feature flags decouple code deployment from feature activation. They complement all three traffic strategies but replace none of them.
Define PodDisruptionBudgets explicitly and test them during staged node drain exercises — not for the first time during a production rolling update.
For canary analysis, check latency percentiles alongside error rate. A regression that returns 200 at 3x the latency will never trip an error-rate-only gate.