Cell-based architecture: buying down blast radius

The scariest outages aren't the ones where a server dies — those are easy, health checks catch them and traffic routes away. The scary ones are the gray failures: a slow dependency, a poisoned cache, a bad config that makes a system mostly-work in a way that fools every health check while it quietly corrupts or stalls everything behind it. Cell-based architecture is a direct answer to that class of failure. The principle is deliberately simple: stop building one big shared system that can fail all at once, and start building many small ones that can only fail a fraction at a time. The operational payoff — and the reason Slack, Amazon, and others have staked serious engineering years on it — is that "drain one cell and route around it" is a runbook step, while "debug a production system under full load while it's partially failing" is a multi-hour incident.

What a cell actually is

A cell is a complete, independent deployment of a service stack — its own compute, its own data store, its own instances of every shared dependency — able to serve a defined subset of traffic without touching any other cell. "Completely independent" carries most of the weight here. A cell that shares a database with its neighbors is not a cell; it is a replica. A cell that calls a shared authentication service on every request has a shared dependency that can fail all cells simultaneously.

The canonical partition key is user identity or tenant ID. Each user is hashed to a cell, and their requests land there for the life of their session or until a deliberate migration. Other partition schemes are common:

Geographic shards: cells are AZ-local or region-local slices; traffic is routed by client geography. Common in global consumer apps with data-residency requirements.
Tenant tiers: enterprise tenants on dedicated cells, self-serve users on shared multi-tenant cells. A bad enterprise tenant deploy is contained to their cell.
Traffic-type partitions: read path in one cell, write path in another, with async replication bridging them. Useful when read and write scaling characteristics diverge significantly.

Cell count is a design decision with measurable tradeoffs. AWS's re:Invent 2024 presentation ARC312 on cell-based resilience frames it directly: in an 8-cell system, a failure that fully degrades two cells produces 25% customer impact. Double the cell count to 16, and the same incident touches 12.5%. The returns diminish as cells multiply — the operational overhead of provisioning, observing, and deploying to each cell grows linearly, while the marginal reliability gain from adding the 17th cell is small. Most production systems land between 8 and 16 cells as the practical sweet spot.

Architecture

A cell-aware router fans traffic to independent cells. Cell 3 is degraded and draining — the failure is contained. The other cells keep serving normally.Source: AWS re:Invent 2024, ARC312

Shared fate vs contained fate

The shift is less about technology and more about what a single failure is allowed to touch. In a shared system, every piece of infrastructure is load-bearing for every user simultaneously. In a cell-aware system, the blast radius is a known, bounded fraction.

shared fate

One large shared system

A bad deploy or gray failure can affect every user simultaneously
Recovery means fixing it live, under full load, with every team watching
Blast radius is the entire service — 100% impact ceiling
Gray failures spread silently across all AZs through shared dependencies
Rollback is a high-stakes, all-or-nothing operation with no safe partial path

contained fate

Many independent cells

A failure is contained to one cell's traffic — a known fraction of users
Recovery is draining the cell and routing around it, not fixing production under load
Blast radius ceiling is the reciprocal of the cell count
Gray failures plateau at one cell boundary — no shared dependency to propagate through
Bad releases are caught in the first cell of a ring deploy before reaching the fleet

The same incident, two architectures — the difference is how far a fault can travelSource: ClimsTech Engineering

How the router and drain cycle work

The cell router is the load-bearing piece of the architecture. It holds a partition table that maps incoming requests to cells: given a user ID or tenant ID, return the cell identifier. In practice this is a hash function over a routing table, and the table is cached inside the router process so that a routing-store outage does not interrupt live traffic.

A minimal implementation in Go:

// PartitionTable maps a hash ring to cell identifiers.
// Loaded at startup and refreshed every 30s from the config store.
type PartitionTable struct {
    Buckets [16]string // e.g. ["cell-1","cell-1","cell-2","cell-2",...,"cell-8"]
    mu      sync.RWMutex
}
 
func (t *PartitionTable) Route(userID string) string {
    t.mu.RLock()
    defer t.mu.RUnlock()
    h := fnv.New32a()
    h.Write([]byte(userID))
    bucket := h.Sum32() % uint32(len(t.Buckets))
    return t.Buckets[bucket]
}
 
// DrainCell marks a cell as draining in the partition table.
// New sessions stop routing to it; in-flight requests complete normally.
func (t *PartitionTable) DrainCell(cell string) {
    t.mu.Lock()
    defer t.mu.Unlock()
    for i, b := range t.Buckets {
        if b == cell {
            t.Buckets[i] = t.pickAlternate(cell)
        }
    }
}

When a cell degrades, the drain cycle follows a fixed sequence:

Cell drain procedure — should be automated as a runbook, not manual

01
Detect
An SLO breach fires for a specific cell — error rate, latency p99, or crash rate. Per-cell dashboards surface this before global aggregates wash it out. The alert should name the cell.
02
Mark draining
The cell is flagged in the partition table as draining. The router stops directing new sessions to it. In-flight requests complete on the existing connections.
03
Drain
Over a configurable window — typically 2 to 10 minutes depending on connection hold time — remaining sessions complete or time out. The cell traffic share approaches zero.
04
Isolate
The cell is removed from the partition table entirely. Operators can now debug it safely: replay traffic, restart services, patch config — none of this touches live traffic.
05
Restore
Once the cell passes canary traffic and smoke tests, it re-enters the partition table at reduced weight, then ramps back to full share over 10 to 20 minutes.

Source: ClimsTech Engineering

The critical design constraint: the partition table must survive independently of any individual cell. If the routing store is co-located in one cell, draining that cell tears down the router for every other cell simultaneously. The partition table belongs in a separate, highly available store — a replicated key-value service external to any cell — with the router holding a known-good snapshot in local memory as fallback. Test the fallback path in staging before you need it in production.

For Kubernetes-based deployments, an OpenResty cell-aware ingress reads drain state from a Redis config store with a local fallback:

# OpenResty (nginx + Lua) — routes by X-User-ID header, reads drain state from Redis
set_by_lua_block $target_cell {
  local user_id = ngx.req.get_headers()["X-User-ID"] or ""
  local redis   = require "resty.redis"
  local red     = redis:new()
  red:set_timeout(5)   -- 5ms; fall back to local table if routing store is slow
  local ok, _   = red:connect("cell-routing-store.infra.svc", 6379)
  if not ok then
    -- last-known-good snapshot baked into shared dict at startup
    return ngx.shared.cell_buckets:get(user_id) or "cell-1"
  end
  return red:hget("cell_routes", user_id) or "cell-1"
}
proxy_pass http://$target_cell.svc.cluster.local:8080;

Real implementations add weighted routing for gradual traffic shifts during drains, circuit-breaker logic if the routing store is slow rather than fully unavailable, and an audit log of every partition-table mutation. The above is the conceptual skeleton.

What failure containment looks like in numbers

A worked example makes the math concrete. Suppose your service runs 8 cells serving 400,000 active users. Each cell owns roughly 50,000 users.

A bad config deploy reaches cell 4 before an alert fires:

Without cells: 400,000 users are degraded. Recovery requires a full rollback under load; MTTR is typically 30 to 90 minutes depending on pipeline speed and incident chaos.
With 8 cells: 50,000 users (12.5%) are degraded. Cell 4 is drained in approximately 5 minutes. The remaining 350,000 users never see the failure. Operators debug the bad config on an isolated cell with no production pressure.

The difference is not only the fraction affected — it is the character of the incident. Draining a cell is a mechanical runbook step. Fixing a production system under full load while a fraction of calls are globally failing is a high-pressure operation that produces follow-on mistakes.

Maximum blast radius by cell count — impact ceiling when exactly one cell fully fails

No cells (monolithic)100%

4 cells25%

8 cells12.5%

16 cells6.25%

32 cells~3%

Source: AWS re:Invent 2024, ARC312 (ClimsTech derivation)

The diminishing-returns pattern is clear. Going from 1 cell to 4 drops the ceiling from 100% to 25%. Going from 16 to 32 moves it from 6.25% to 3.1%. At some point the operational cost of running and observing each additional cell exceeds the reliability gain. Most teams find the inflection point somewhere between 8 and 16 cells, where the blast radius has dropped below a threshold that aligns with SLA obligations.

DORA's 2024 State of DevOps report puts the baseline in perspective: approximately 19% of surveyed organizations reached elite status, characterized by on-demand deployment frequency, a change failure rate around 5%, and MTTR under one hour. Cell architecture both enables those numbers — ring deploys catch bad releases in cell 1 before the fleet is exposed — and demands them. You need a sub-hour MTTR to drain and restore cells with operational confidence.

DORA 2024 — elite performer benchmarks

On-demand

Deploy frequency

elite tier

~5%

Change failure rate

elite tier

Under 1h

MTTR

elite tier

~19%

Teams at elite

of surveyed orgs

Source: DORA State of DevOps, 2024

Slack's 1.5-year migration: what actually happened

Slack's move to cell-based architecture on AWS is one of the best-documented production adoptions. Presented by Paul Rapa, Senior Staff Software Engineer at Slack, at AWS re:Invent 2024 (ARC335) — Slack's original engineering write-up had already been covered in detail by InfoQ in January 2024 — the migration took approximately 1.5 years across most critical user-facing services.

The trigger was precise: a networking outage in a single AWS availability zone caused service degradation that spread across all AZs through shared dependencies. The failure was not contained because there was nothing to contain it — the shared architecture had no blast radius boundary. A partial fault in one AZ became a global incident.

A networking outage in a single availability zone spread to all AZs through shared dependencies. Cell isolation gives a blast radius boundary that stops the propagation.

— AWS re:Invent 2024 ARC335 — Slack cell-based architecture (Paul Rapa)

After migration, Slack can drain traffic away from a degraded availability zone within approximately 5 minutes. That figure is the concrete engineering output of the decision. It is not about code throughput or deploy cadence — it is about how quickly the blast radius can be closed once an incident is detected.

Three aspects of the Slack migration are worth studying closely:

Incremental rollout, not a big-bang rewrite. Critical user-facing services moved first; less sensitive workloads followed. The first cell-aware service proved the drain mechanism worked under real traffic before the rest of the fleet depended on it. You cannot confidently migrate everything to cells if you have never actually drained one in production.

The routing table became a first-class artifact. Every service that onboarded to cells had to be wired to read the partition table and respect drain state. This is a real engineering cost — not a configuration toggle. Services that assume they can call any upstream freely must be audited for hidden cross-cell dependencies before they participate safely in a drain.

Gray failures were the explicit target. The stated goal was handling partial faults, not improving throughput or reducing infrastructure cost. This matches the pattern we observe broadly: teams that invest in cell architecture are almost always responding to an incident where a partial health-check failure missed a spreading fault and the blast radius ended up larger than it should have been.

What cells actually cost you — and when to build them

Cells are a reliability investment with real operational overhead. The question is whether the cost of an outage at your scale justifies what you are paying to run and maintain the cell layer.

| Criterion | Lean toward cells | Skip cells for now | |---|---|---| | Cost of a full-service outage | Directly measured and significant — revenue, SLA penalties, trust damage | Modest; engineering time to remediate is the primary cost | | User base | Tens of thousands or more; partition keys are meaningful | Small enough that a single deployment unit is manageable | | Infra team capacity | Can own a routing layer, cell provisioning automation, per-cell observability | Thin team; cells add more toil than they remove | | Deploy pipeline maturity | CI/CD is fully automated; ring deploys are feasible | Manual or semi-manual deploys; cells multiply the manual steps | | Dependency isolation | Auth, DB, cache can be replicated per-cell without prohibitive cost | Tight coupling to a single global database makes per-cell isolation impractical | | Data residency requirements | User data must stay in a specific region or AZ — cells enforce the boundary naturally | No residency requirements; a global data model is acceptable |

The decision framework is blunt: if your outage cost is real and measurable, and your team can own the routing layer and automation, cells pay. If you are a small team shipping fast, the overhead is pure tax on your velocity and will be resented — rightly. The pattern earns its cost at the scale where a one-hour outage carries a dollar figure, not before.

Six pitfalls teams hit in production

1. The routing table becomes a single point of failure

The partition table that maps users to cells is now the one component whose outage takes down every cell simultaneously. Teams frequently build the routing store as an afterthought — a single Redis instance, or a database row the router reads on every request.

Fix: give the routing store at least the same availability SLO as your busiest cell. Read-through caching in the router process, with a known-good local snapshot as fallback, prevents a routing-store outage from cascading. Test the fallback path in staging before you need it. A routing store that has never been failed in a drill is a routing store you cannot trust.

2. Shared dependencies defeat the isolation

If every cell calls the same authentication service, the same secrets manager, the same DNS record backed by a single host — you have cells that share fate through those dependencies. A fault in the shared component affects all cells simultaneously, exactly as it would in a non-cell architecture.

Fix: audit every external call a cell makes. Classify each as cell-local (per-cell replica or sidecar, fully isolated) or genuinely global shared (unavoidable). For global shared dependencies, define the degraded behavior: can cells fall back to cached credentials for a short window? Can they continue serving from a stale config? Auth services in particular should carry per-cell local caches that survive a short external outage.

3. Hot cells from uneven partition keys

Hashing user IDs to cells assumes users generate roughly equal load. In practice, some users generate orders of magnitude more traffic than others. A single high-volume user can saturate one cell while others run at low utilization.

Fix: instrument per-cell load from day one — RPS, CPU utilization, error rate, memory pressure. AWS's shuffle sharding technique, documented in the Route 53 engineering blog by Colm MacCarthaigh, reduces the probability that two high-traffic users land in the same cell by making each cell a unique random subset drawn from a larger pool of backend instances rather than a fixed dedicated set. For multi-tenant platforms, give your largest tenants dedicated cells before they become hot enough to affect the shared cells.

4. Cross-cell reads are deferred until they force a rework

Some operations are inherently cross-cell: global search, analytics dashboards, user-timeline aggregations that span data regardless of which cell it originated in. Teams frequently defer the cross-cell read problem until a product requirement forces a retrofit under pressure.

Fix: decide the cross-cell read story before building the first cell. The common options are an async replication pipeline (data from every cell flows to a read-optimized global store), a fan-out query layer (query all cells in parallel, merge at the application layer), or accepting that certain queries are cell-local only. None of these is cost-free. The wrong choice is having no decision.

5. Ring deploys get skipped under deadline pressure

Ring deploys are the main delivery benefit of cells. But they only work if the deployment pipeline actually enforces cell boundaries — deploying one cell, soaking, checking SLOs, then progressing. If the team shortcuts to "deploy all cells" under deadline, the blast-radius protection exists in theory only.

Fix: make the ring deploy the only permitted deployment path for production. Gate progression between cells on automated SLO checks — error rate, latency p99, crash rate — with a mandatory soak window. Manual gates get bypassed under pressure. Automated gates do not.

6. Per-cell observability is bolted on late

Aggregated metrics hide the cell that is quietly failing. If your global error rate is 0.1%, one cell running at 4% is invisible in the aggregate. You miss the early drain window and discover the problem at 10 times the impact.

Fix: tag all metrics with the cell identifier from the first day cells exist in production. Create per-cell SLO dashboards and alert on individual cell health, not just global aggregates. The alert "cell-4 error rate is 3x baseline" is what turns a potential incident into a five-minute drain. Without it, you are flying without instruments.

Ring releases: deploying safely cell by cell

The ring deploy converts cells from a reliability mechanism into a delivery safety net. Rather than deploying a new version to all cells simultaneously, you progress sequentially: cell 1 first, observe it against production SLOs for a soak window, then proceed to cell 2, and so on. A bad release is caught in cell 1 before it has ever touched cells 2 through 8.

Ring deploy across 8 cells — a bad release is caught in cell 1, not cell 8

01
Deploy cell 1
New version rolled out to cell 1 only. All other cells continue serving the previous version and handle the full remaining traffic without interruption.
02
Soak window
Automated check: error rate, p99 latency, crash rate in cell 1 vs pre-deploy baseline. Typically 10 to 30 minutes, depending on traffic volume needed for statistical confidence.
03
Gate check
If metrics are within SLO thresholds, proceed automatically to cell 2. If any SLO fires, halt the ring, drain cell 1 back to the previous version. At most 1 in N users were ever exposed to the bad release.
04
Progress the ring
Repeat for cells 2 through N, with identical gate checks at each boundary. Soak windows can be shortened for later cells once earlier cells have validated the release under real traffic patterns.
05
Complete
All cells are on the new version. The last cell received the deploy after N-1 cells had already validated it against real production behavior. The blast radius of any latent bug is now 1 in N users.

Source: ClimsTech Engineering

The gate check belongs in your CI pipeline as an automated step, not a human watching a dashboard. A shell-script gate using Prometheus:

#!/usr/bin/env bash
# gate-check.sh — exits non-zero if error rate exceeds threshold; halts the ring deploy
set -euo pipefail
 
CELL="${1:?usage: gate-check.sh CELL_NAME}"
PROM_URL="${PROMETHEUS_URL:?PROMETHEUS_URL env var required}"
THRESHOLD="${ERROR_THRESHOLD:-0.02}"  # 2% default; override per service baseline
 
QUERY="rate(http_requests_total{cell=\"${CELL}\",status=~\"5..\"}[5m]) \
  / rate(http_requests_total{cell=\"${CELL}\"}[5m])"
 
RATE=$(curl -sf "${PROM_URL}/api/v1/query" \
  --data-urlencode "query=${QUERY}" \
  | jq -r '.data.result[0].value[1] // "0"')
 
python3 -c "
import sys
rate = float('${RATE}')
threshold = float('${THRESHOLD}')
if rate > threshold:
    print(f'GATE FAIL: ${CELL} error rate {rate:.2%} exceeds threshold {threshold:.0%}')
    sys.exit(1)
print(f'GATE PASS: ${CELL} error rate {rate:.2%}')
"

The threshold (2% here) should be derived from your actual production baseline plus a margin — not chosen from intuition. A service that normally runs at 1.8% errors will false-positive on a 2% gate constantly, leading teams to raise or disable the gate. Measure your baseline for a week before you set the threshold.

Cells without ring deploys are reliability infrastructure that does not protect your delivery pipeline. Ring deploys without per-cell observability cannot detect the problems they are checking for. Cell isolation, per-cell metrics, and automated ring deploy gates compose into the system. Any two without the third leaves a gap that will matter in the worst moment.

The first time a ring deploy halts in cell 1 and prevents a bad release from reaching the rest of the fleet, the architecture has paid for itself. Build toward that first catch. Start with one workload, one routing layer, one automated drain procedure. Prove it works under real traffic. Expand from there.

What to remember

Cells buy down blast radius to a known fraction of traffic — specifically, the reciprocal of the cell count. 8 cells means a single-cell failure touches at most 12.5% of users.
Gray failures spread through shared components. Cells stop the spread at a cell boundary — the failure plateaus rather than propagating across availability zones.
The routing table and drain automation are the load-bearing pieces. A routing table that is itself a single point of failure negates the isolation. Automate the drain or the pattern does not work under operational pressure.
Ring deploys turn cells into a delivery safety net: a bad release is caught in cell 1, exposing 1/N of users, before reaching the fleet. Gate checks must be automated — manual gates get skipped.
Slack migrated most critical user-facing services over approximately 1.5 years and achieved a 5-minute AZ drain window. The trigger was a gray failure that spread through shared dependencies across availability zones (AWS re:Invent 2024, ARC335 — Paul Rapa).
Six production failure modes to watch: routing table SPOF, shared dependencies that defeat isolation, hot cells from uneven traffic distributions, unplanned cross-cell reads, ring deploys that get bypassed under deadline pressure, and per-cell observability added too late.
Cells earn their cost when an outage carries a measurable dollar figure and the team can own the routing layer and automation. They are not a default pattern for every service.
Start with one cell-aware workload. Prove the drain works under real traffic. Let the first contained incident teach you what you missed before expanding the architecture.

Cell-based architecture: buying down blast radius

What a cell actually is

Shared fate vs contained fate

How the router and drain cycle work

What failure containment looks like in numbers

Slack's 1.5-year migration: what actually happened

What cells actually cost you — and when to build them

Six pitfalls teams hit in production

1. The routing table becomes a single point of failure

2. Shared dependencies defeat the isolation

3. Hot cells from uneven partition keys

4. Cross-cell reads are deferred until they force a rework

5. Ring deploys get skipped under deadline pressure

6. Per-cell observability is bolted on late

Ring releases: deploying safely cell by cell

Reading the field notes?