Most teams that suffer deploy-related outages picked the right strategy. They configured a rolling deploy, tested blue-green in staging, maybe even set up a canary. Then the migration ran, the old version started throwing 500s on a column that no longer existed, and the "instant rollback" they were counting on did absolutely nothing about the database. Zero-downtime deployment is not one problem — it is three simultaneous problems: traffic routing, service replacement, and schema migration. Each has different rollback properties, different latency budgets, and different blast radii. Teams that conflate them will eventually have a bad night. This article separates the three, gives you the real Kubernetes configuration for each traffic strategy, and explains why the database section is the only one that should actually keep you up at night.
What a failed deploy actually costs
Before choosing a strategy, understand what you are buying insurance against.
Gartner's widely cited baseline puts IT downtime at roughly $5,600 per minute — a figure from their 2014 study that has been quoted so many times it has taken on a life of its own. More recent survey data paints a worse picture: ITIC's 2024 Hourly Cost of Downtime Survey found that more than 90% of mid-size and large enterprises lose over $300,000 per hour of downtime, and 41% of large enterprises lose between $1 million and $5 million per hour. EMA Research, cited by BigPanda in 2024, puts the average at $14,056 per minute for unplanned outages — rising to $23,750 per minute for large enterprises.
These numbers include direct revenue loss, SLA penalties, support labor, and reputation damage. They do not include the secondary cost that engineers rarely measure: the opportunity cost of an all-hands incident pulling your team off roadmap work for two to four days afterward.
On-demand
Deploy frequency
multiple times/day
~5%
Change failure rate
elite threshold
Under 1 hr
Failed deploy recovery
renamed from MTTR in 2024
Under 1 hr
Lead time for changes
commit to production
Source: DORA State of DevOps Report, 2024
The DORA 2024 report added a fifth metric — rework rate — and refined the recovery definition to focus specifically on failed-deployment recovery rather than any infrastructure failure. Elite performers recover from bad deploys in under an hour, partly because their deploy strategies are engineered for rapid rollback, and partly because they have the observability to detect a bad deploy within minutes of it hitting production.
The takeaway is not "deploy more." It is: your deploy process should make rollback the cheapest, fastest operation available, so that the cost of a bad deploy is bounded by how quickly you can trigger one.
DORA's throughput vs. stability lens
DORA clusters teams into four performance tiers — elite, high, medium, and low — based on a composite of deployment frequency, lead time for changes, failed-deployment recovery time, and change failure rate. In 2024, something unusual happened: the medium cluster began showing lower change failure rates than the high cluster. DORA interpreted this as high performers deliberately accepting slightly higher failure rates in exchange for faster deployment cadence — a conscious trade-off, not a measurement error.
This matters for deploy strategy selection. A team deploying several times per day cannot afford the operational overhead of maintaining two full environments for every deploy. Their risk model is different: frequent deploys are individually smaller and therefore lower-risk, making rolling deploys with good observability the economically rational choice. A team deploying weekly ships larger batches, faces higher per-deploy risk, and benefits more from a canary or blue-green gate.
High performers accept a slightly higher change failure rate as the price of deploying more frequently — it is a deliberate trade, not a measurement error.
Rolling deploys: default, not fallback
Kubernetes Deployments implement rolling updates by default. The core configuration that most teams leave at default values is worth understanding precisely:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # never reduce below 6 healthy pods
maxSurge: 2 # allow up to 8 pods during rollout
template:
spec:
containers:
- name: api
image: payments-api:v2.4.1
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 10The pair that matters: maxUnavailable: 0 guarantees you never drop below full capacity during the rollout. maxSurge: 2 controls how many excess pods run simultaneously. On a 6-replica Deployment with maxSurge: 2, Kubernetes creates 2 new pods, waits for them to pass their readiness probe, removes 2 old pods, and repeats. With maxUnavailable: 1 and maxSurge: 0, you replace one pod at a time — slower, but no extra resource cost.
The readiness probe is the mechanism that makes rolling safe. If a new pod fails its readiness probe, Kubernetes stops the rollout and leaves existing pods in place. The most common misconfiguration is a readiness probe that passes before the application is actually ready — returning 200 before the JVM has finished warming up its connection pools, or before the service has loaded its config from a secret store. Your probe endpoint must actually exercise the critical path, not just assert "process is alive."
Rollback under rolling
Rolling rollback — kubectl rollout undo deployment/payments-api — is itself a rolling operation. On 6 replicas, expect 4 to 12 minutes depending on pod startup time and image pull speed. This is the fundamental limitation of the strategy: if your deploy broke production, production stays broken for the full duration of the rollback. For most internal tooling and low-traffic services this is acceptable. For anything handling user-visible transactions at scale, it is not.
Blue-green deploys: instant rollback at a price
Blue-green solves the rollback latency problem by keeping the previous version running alongside the new one. Traffic is routed by a Kubernetes Service selector (or an upstream load balancer rule). Rollback is changing that selector back — which propagates to kube-proxy in well under a second.
# Active service — selector controls which version receives traffic
apiVersion: v1
kind: Service
metadata:
name: payments-api
spec:
selector:
app: payments-api
slot: blue # change to "green" to cut over; revert to rollback
ports:
- port: 80
targetPort: 8080The green Deployment runs at full replica count before the selector is switched. You smoke-test against green directly via a separate service (payments-api-green) before touching the live selector. Cutover is a single kubectl patch:
kubectl patch service payments-api \
-p '{"spec":{"selector":{"slot":"green"}}}'Rollback is the same command with slot: blue. The total time from "something is wrong" to "traffic back on old version" is typically under 30 seconds — limited by kube-proxy propagation, not pod startup.
The cost is real and should not be underestimated. During the cutover window you run two full Deployments at 100% replica count each. For a service sized to handle peak load on 20 pods, you are briefly running 40. On a team deploying daily across 10 services, the cumulative cost of maintaining two environments during cutover windows adds up. The window is usually short — minutes for automated pipelines — but it requires capacity headroom that must be pre-provisioned.
Blue-green is the right default for payment processing, authentication, and any flow where a mixed rollout — users on v1 and v2 simultaneously — creates functional inconsistencies that are hard to reason about or remediate.
Canary releases: evidence-based progression
A canary is not a slow rolling deploy. It is a controlled experiment: route a small, measurable fraction of traffic to the new version, observe real metrics (not just "is it up"), and use those metrics to decide whether to advance or abort. Without automated analysis driving the decision, a canary is just a rolling update with extra steps and more ceremony.
Argo Rollouts is the de facto standard for canary deployments on Kubernetes. Its AnalysisTemplate resource lets you define success criteria before the deploy begins:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
job="payments-api",
status!~"5..",
version="{{args.version}}"
}[5m])) /
sum(rate(http_requests_total{
job="payments-api",
version="{{args.version}}"
}[5m]))The Rollout object then references this template at each weight step:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
spec:
strategy:
canary:
steps:
- setWeight: 5
- analysis:
templates:
- templateName: success-rate
args:
- name: version
value: "{{inputs.parameters.revision}}"
- pause: {duration: 10m}
- setWeight: 25
- analysis:
templates:
- templateName: success-rate
- pause: {duration: 10m}
- setWeight: 100At each step, Argo evaluates the analysis. If success rate drops below 95% three consecutive times, the Rollout automatically aborts and routes all traffic back to the stable version. The blast radius is bounded: during the initial 5% step, fewer than 1 in 20 users ever touches the new code.
Blue-green
- Routes 100% of traffic to new version after green passes smoke test
- Rollback is a load balancer flip: typically under 30 seconds
- Requires 2x capacity during the cutover window
- All users hit the new version simultaneously
- Right for: payment flows, auth, schema-sensitive endpoints
Canary
- Routes a small slice (1-5%) to the new version first
- Automated analysis gates each progression step
- Rollback triggers automatically when analysis fails
- Blast radius is bounded to the canary traffic percentage
- Right for: high-traffic APIs, risky logic changes, ML model updates
Canary without observability is theater
The single most common canary failure mode is having the mechanism in place but not the metrics. If your Prometheus query returns no data — wrong label names, absent metric, clock skew between scrapes — the AnalysisTemplate will either silently pass or immediately fail depending on your failureLimit config. Test your analysis queries against historical data before a live canary. If the query returns zero for the canary version during the smoke test, your safety gate has no teeth.
The database: where zero-downtime actually fails
Traffic routing is the easy part. The thing that ends zero-downtime deploys is the database.
The failure mode is predictable. You need to rename a column. Version N+1 of your application uses user_email_address. Version N uses email. During the rolling update, pods on both versions run simultaneously. Version N reads email and gets data. Version N+1 reads user_email_address and gets null or throws an error. You have a partial outage for the entire duration of the rollout — and "rollback" restores the N pods but leaves rows written by N+1 with the new column name, breaking reads on the old code.
The fix is expand-migrate-contract, a pattern documented by Martin Fowler and Pramod Sadalage in "Evolutionary Database Design" and "Refactoring Databases." It treats database and application changes as separate deploys with a strict sequencing discipline.
- 01
Expand
Add the new shape alongside the old. The new column is nullable or has a default so existing rows are valid. Old application code ignores the new column and continues to work. New application code writes to both old and new columns simultaneously.
- 02
Migrate
Deploy the new application version with dual-write enabled. Backfill existing rows in batches — never a single UPDATE across the full table on an indexed range condition. Application now reads from the new column but still writes to both for safety during the transition.
- 03
Contract
Ship a follow-up deploy that drops the old column and removes the dual-write path. This deploy ships only after 100% of instances run the new version and after you have confirmed via query logs that nothing reads the old column.
Source: Martin Fowler and Pramod Sadalage, 'Refactoring Databases', 2006; Martin Fowler, evolutionaryDatabase, martinfowler.com
The column rename above requires three deploys, not one. That is the price of zero-downtime migrations. Teams that resist this overhead usually pay it as incident time instead.
Practical SQL for expand-migrate-contract
The expand phase adds the new column without constraints:
-- Deploy 1: Expand (safe to run while old code is live)
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
-- Do NOT add NOT NULL here — existing rows have no value yetThe backfill runs in batches to avoid table locks and replication lag:
-- Batched backfill: run from a migration script or background job
UPDATE users
SET email_address = email
WHERE email_address IS NULL
AND id BETWEEN :min_id AND :max_id;
-- Repeat in chunks of 1,000-10,000 rows with a short pause between batchesA single UPDATE users SET email_address = email without a WHERE clause on a table with tens of millions of rows will lock the table, spike replication lag on your read replicas, and likely hit your cloud database's statement timeout. Batch it.
The contract phase, shipped after full rollout:
-- Deploy 3: Contract (only after every running pod is on the new version)
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users ALTER COLUMN email_address SET NOT NULL;Feature flags: the orthogonal layer
Feature flags are not a deployment strategy. They are a mechanism for decoupling code deployment from feature activation, and used alongside any of the three strategies above, they eliminate the riskiest class of deploys: the ones where the code change is safe but the business logic change is not.
The model is simple: deploy code with new behavior gated behind a flag. The deploy itself becomes inert — all traffic follows the old path. Once the deploy is fully rolled out and healthy, activate the flag, starting at a small user percentage and ramping. Rollback is toggling the flag off, not re-running a deploy.
Services like LaunchDarkly, Unleash (self-hosted), and Flipt provide targeting rules that control rollout by user segment, region, or random percentage. The key discipline that teams consistently underinvest in is flag lifecycle management. An activated flag that has been stable for two weeks should be removed from the codebase in the next sprint. Flag debt accumulates silently, creating conditional branches that are impossible to fully test and make incident debugging significantly harder.
Decision framework
The right strategy depends on three variables: how risky is the change, how fast must rollback complete, and what is your available capacity budget.
| Scenario | Recommended strategy | Reasoning | |---|---|---| | Routine bug fix, low-traffic service | Rolling | Low blast radius; rollback latency acceptable | | Payment or auth service, any change | Blue-green | Mixed versions are unsafe; rollback speed is critical | | High-traffic API, risky logic change | Canary | Bounded blast radius; automated abort limits exposure | | Schema change alongside any deploy | Expand-migrate-contract | Non-negotiable regardless of traffic strategy | | Risky business logic, stable code | Feature flag | Decouple deploy from activation entirely | | Daily deploys, elite DORA tier | Rolling + feature flags | Blue-green overhead is too high at this cadence | | ML model update or algorithmic change | Canary with latency analysis | Error rate alone won't catch quality regressions |
No single answer fits all services in an organization. A realistic platform deploys the majority of services via rolling update and reserves blue-green or canary for the 20% of services where a bad deploy has the highest blast radius.
Pitfalls and fixes
1. Readiness probe returns 200 before the app is actually ready
The probe passes, Kubernetes routes traffic, the first real request hits a service that has not connected to its database yet. Fix: your readiness endpoint must verify every critical downstream dependency — database connection pool, Redis, any synchronous config store. A failing dependency returns 503. Accept the slightly longer startup time; it is far cheaper than the alternative.
2. The canary gets no real traffic weighting
Your ingress routes to the Service, the Service selector matches all pods including canary pods, but Argo's traffic splitting is not wired to the ingress controller. Argo Rollouts requires integration with a supported ingress controller (NGINX, Traefik, ALB) or a service mesh (Istio, Linkerd). Without the integration, traffic splitting is best-effort based on pod count ratios, not configured percentage weights. A 5-pod canary on a 20-pod stable version will receive approximately 20% of traffic regardless of your setWeight: 5 config. Read the Argo Rollouts traffic management documentation for your specific ingress before assuming the weight is enforced.
3. Blue-green with long-lived in-flight connections
A microservice maintains persistent connections when the selector switches. Some proxies drain connections gracefully; others close them immediately. Old blue pods continue handling in-flight requests for up to several seconds after the selector flip. Add a preStop lifecycle hook with a sleep matching your upstream proxy's drain timeout:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
terminationGracePeriodSeconds: 30This gives the load balancer time to stop routing new requests to the pod before it begins shutting down. The terminationGracePeriodSeconds must be longer than the preStop sleep or Kubernetes will SIGKILL the pod before in-flight requests complete.
4. Schema migration runs after new pods start receiving traffic
CI/CD pipelines that run migrations as a post-deploy step expose a window where the new application code is live against the old schema. Always run migrations before new pods start receiving traffic. In Kubernetes, implement this as an init container or a pre-deploy Job:
initContainers:
- name: run-migrations
image: payments-api:v2.4.1
command: ["./migrate", "--run"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: urlThe init container runs the migration, exits 0, and only then does Kubernetes start the main container and route traffic to it. Because your migrations follow expand-migrate-contract, they are additive-only and therefore safe to run against the live database while the previous version still serves traffic.
5. PodDisruptionBudget conflicts during rolling updates
A rolling deploy and a concurrent node drain can compound their maxUnavailable effects. If your PodDisruptionBudget allows 1 unavailable pod, and the rolling update also terminates 1 pod, you can briefly have 2 unavailable pods simultaneously — silently breaching the PDB. Define PDBs explicitly and test them with kubectl drain in staging before encountering this combination in a production incident:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payments-api-pdb
spec:
minAvailable: "75%"
selector:
matchLabels:
app: payments-apiSetting minAvailable as a percentage handles replica count changes automatically. An absolute value of minAvailable: 4 on a Deployment you later scale down to 3 replicas will prevent node drains from proceeding at all.
6. Canary analysis only checks error rate
A new version that responds with 200 but at three times the latency degrades user experience without tripping an error-rate gate. Add a latency percentile query alongside the success-rate check:
- name: p99-latency
interval: 60s
successCondition: result[0] <= 0.5 # 500ms p99 ceiling
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
job="payments-api",
version="{{args.version}}"
}[5m])) by (le)
)For ML-serving endpoints, you may also want a business-metric check — click-through rate, conversion rate, or model confidence score — alongside the infrastructure signal. Argo Rollouts supports multiple analysis metrics in parallel; a rollout advances only when all metrics pass.