SLOs and error budgets: turning reliability into a number

"Should we ship the feature or fix the flakiness?" Without numbers, that debate is decided by whoever holds the most political capital in the room. SLOs and error budgets replace the politics with a rule everyone agreed to in advance — and the rule has teeth precisely because it was written when nobody was under pressure.

Here is the sharper claim this article defends: SLOs are not a monitoring configuration. They are a decision framework. The Prometheus rules and Grafana dashboards are incidental; the real value is that they shift reliability conversations from opinion to policy, from escalation to pre-agreed consequence. If you install SLOs without the policy, you have expensive dashboards and nothing else.

DORA 2024 elite-tier benchmarks

On-demand

Deploy frequency

elite tier

Under 1 hour

Lead time for changes

elite tier

Under 1 hour

Time to restore service

elite tier

~5%

Change failure rate

elite tier

Source: DORA Accelerate State of DevOps Report, 2024

Only 19% of teams reached elite tier in the 2024 DORA survey, and the high-performance cluster shrank from 31% to 22% year-over-year while the low-performance cluster grew from 17% to 25%. The gap between elite and average is not primarily tooling. It is process discipline — and SLOs are one of the few process interventions with a clear, auditable feedback loop.

SLI, SLO, SLA: the precise hierarchy

These three terms get conflated constantly. The distinctions matter operationally.

An SLI (Service Level Indicator) is a quantitative measurement of a service property. It is always expressed as a ratio: a count of good events divided by a count of total events, over a time window. That ratio is the only form that composes cleanly into an error budget. Averages and raw percentiles do not.

An SLO (Service Level Objective) is the internal target range for an SLI. "99.9% of HTTP requests will return a 2xx or 3xx response in under 300 ms, measured over a rolling 28-day window." It is internal — no contractual obligation, no financial penalty. That matters: internal targets can be honest. If you tie SLOs to bonuses or penalties before you know what your service actually delivers, you get sandbagged targets driven by risk aversion rather than measurement.

An SLA (Service Level Agreement) is a contractual commitment to a customer, typically covering availability only, and almost always set 10–20% below your actual SLO so you have a buffer. Breaching an SLA triggers penalties: credits, escalations, audit rights. Breaching an SLO triggers internal action. They are different levers.

| Term | Audience | Enforced by | Typical tightness | |------|----------|-------------|-------------------| | SLI | Engineering | Metric pipeline | — (measurement only) | | SLO | Eng + Product | Error budget policy | Honest internal target | | SLA | Customers | Contract + legal | Looser than SLO by design |

The common mistake is inverting this — writing an SLA first and then setting the SLO to match. You end up with targets you cannot relax without a contract amendment, targets driven by sales conversations rather than measurement data.

Choose SLIs the user actually feels

The canonical SLI types from the Google SRE workbook cover most services:

Availability: proportion of requests that succeed (not returning 5xx or timing out).
Latency: proportion of requests served within a latency threshold (e.g., under 300 ms). This is a ratio SLI, not an average — averages hide long-tail misery for the worst-served users.
Throughput: proportion of time the service sustains a minimum request rate without shedding load.
Correctness: proportion of responses containing valid data. Harder to measure, critical for data pipelines and APIs where a 200 response can still be wrong.

The practical rule: if a metric going bad would not cause a user to notice or complain, it probably is not a good SLI. CPU utilisation, pod count, and GC pause time are all useful for debugging, but they are internal proxies, not SLIs. Expose only the SLIs that directly represent user experience; track the proxies separately as diagnostic signals.

For request/response services, availability and latency together cover most cases. For asynchronous pipelines (message queues, batch jobs), correctness and freshness — the proportion of data processed within a defined lag — matter more. For storage systems, durability and read/write latency are the primary SLIs.

Resist the urge to have more than four or five SLOs per service. A wall of SLOs means every incident fires multiple simultaneous breaches, nobody knows which one to prioritise, and the whole framework gets quietly ignored. One to three SLOs per service is the practical operating range for teams that actually act on them.

The error budget: worked numeric example

If your SLO is 99.9% over a 28-day window, your error budget is the 0.1% you are explicitly allowed to fail. At 99.9%, on a service receiving 1 million requests per month, the math is:

Error budget (requests) = 1,000,000 × 0.001 = 1,000 errors/month
Error budget (time)     = 28 days × 24 h × 60 min × 0.001 ≈ 40 minutes/month

That is not a lot of margin. A single 45-minute outage blows through an entire month's budget in one event. Here is how the numbers shift as you tighten the SLO:

| SLO | Monthly error budget (time) | Weekly budget (time) | |-----|-----------------------------|----------------------| | 99.0% | ~7 h 12 min | ~1 h 48 min | | 99.5% | ~3 h 36 min | ~54 min | | 99.9% | ~43 min | ~10 min | | 99.95% | ~21 min | ~5 min | | 99.99% | ~4 min | ~1 min |

This table is the single most useful artefact in an SLO conversation with product management. When a product manager asks for "four nines," they are asking for four minutes of monthly tolerance. That means deployments, restarts, database migrations, and dependency upgrades all count against those four minutes. Most services are not ready for that constraint, and the table makes the tradeoff legible without requiring a reliability lecture.

One more property of the budget worth stating explicitly: it does not roll over. Burn it in week one and you spend weeks two through four in freeze mode regardless of how well the service ran afterward. That asymmetry is intentional — it incentivises front-loading reliability work rather than coasting on an early-month green dashboard.

An error budget is not a target to hit — it is a resource to spend wisely.

— Google SRE Workbook, Chapter 2

Burn rate: the signal that actually fires alerts

Raw monthly SLO compliance is a lagging indicator. If you alert only when the 28-day compliance number dips below 99.9%, you find out too late — the budget is already gone. Burn rate is the ratio at which you are currently consuming the budget relative to the rate that would exhaust it exactly at the window boundary.

A burn rate of 1 means you will exhaust the budget in exactly 28 days — acceptable. A burn rate of 6 means you will exhaust the budget in roughly five days — open a ticket. A burn rate of 14.4 means you will exhaust the budget in roughly 50 minutes — page immediately.

The Google SRE Workbook formalises this as a multi-window, multi-burn-rate alerting pattern. Two lookback windows per alert rule reduce false positives: the short window catches sustained fast burns; the long window provides confirmation that the burn is not a transient spike.

groups:
  - name: slo.burn_rate
    rules:
      - alert: HighBurnRate
        expr: |
          (
            (1 - avg_over_time(sli:http_availability:rate5m[1h]))
            / (1 - 0.999)
          ) > 14.4
          and
          (
            (1 - avg_over_time(sli:http_availability:rate5m[6h]))
            / (1 - 0.999)
          ) > 6
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "SLO burn rate critical — {{ $labels.job }}"
          description: >
            1h burn at {{ $value | humanize }}x. At this rate the 28-day budget
            exhausts in under 50 minutes.
 
      - alert: MediumBurnRate
        expr: |
          (
            (1 - avg_over_time(sli:http_availability:rate5m[6h]))
            / (1 - 0.999)
          ) > 6
          and
          (
            (1 - avg_over_time(sli:http_availability:rate5m[3d]))
            / (1 - 0.999)
          ) > 1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "SLO burn rate elevated — {{ $labels.job }}"
          description: >
            6h burn at {{ $value | humanize }}x. Budget exhaustion in roughly
            5 days if unchanged.

The multiplier thresholds (14.4 and 6) come from Google SRE Workbook Table 5.2 for a 99.9% SLO on a 30-day window. Adjust the denominator (1 - 0.999) to match your actual target.

Burn rate severity thresholds for a 99.9% SLO on a 30-day window

Neutral — exhausts in 30 days1x

Warning threshold — exhausts in ~5 days6x

Critical threshold — exhausts in ~50 min14.4x

Scale reference20x

Source: Google SRE Workbook, Chapter 5 — multi-window burn rate alerting

Burn rate is the concept that separates teams that actually use SLOs from teams that just have them. The monthly compliance number sits in a dashboard and generates no action. A burn-rate alert fires at 2 AM on a Tuesday and puts an engineer in front of a terminal while there is still budget left to protect.

The error budget policy: where the decision framework lives

The alert fires. Now what? Without an error budget policy, the answer depends on who is awake and how loud they are willing to be. With a policy, the answer is written down, pre-agreed, and triggered automatically by the number.

A minimal policy covers four states:

Budget consumption    Action required
─────────────────────────────────────────────────────────────────────────
0% – 50%             Normal operations. Features ship at normal cadence.
50% – 75%            Engineering team notified. Next sprint gains one
                     reliability task selected from the open backlog.
75% – 99%            Release freeze for non-critical changes.
                     A reliability sprint is mandatory.
100% (exhausted)     Full freeze except P0/security fixes until budget
                     recovers. SRE leadership review within 48 hours.

Google's own error budget policy template (SRE Workbook, Appendix A) adds a fifth state for escalation to senior engineering leadership and a review of whether the SLO itself is miscalibrated. That calibration review path is important: exhausting the budget repeatedly is sometimes a signal that the SLO is set too aggressively, not that the service is poorly run.

The policy needs two owners: the SRE or platform team who measures compliance, and the product owner who controls the release calendar. Both must agree to it in writing before the first incident, not during one. A policy drafted during an outage is not a policy — it is a hostage negotiation.

before

Without an error budget policy

Release freeze debated on every incident call with no prior agreement
Reliability priority set by whoever escalates loudest
SLO targets quietly abandoned when they are inconvenient to honour
Engineering and product in recurring conflict with no shared resolution mechanism

after

With an error budget policy

Freeze threshold is pre-agreed: 75% consumed triggers an automatic freeze
Priority is determined by the number, not by seniority or pressure
SLO targets are enforced; calibration reviews are scheduled quarterly
Product and engineering share the same incentive: stay within budget

The policy is the mechanism that makes SLOs more than a dashboardSource: Google SRE Workbook, Chapter 2

Configuring SLOs: a concrete end-to-end example

Walk through setting up SLOs for a canonical HTTP API service. The SLI is availability: 2xx/3xx responses divided by total non-OPTIONS requests, measured over a 28-day rolling window.

Step 1: Define the SLI as a Prometheus recording rule.

groups:
  - name: sli.availability
    interval: 60s
    rules:
      - record: sli:http_availability:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
          /
          sum(rate(http_requests_total{status!~"OPTIONS"}[5m]))

Step 2: Compute current 28-day compliance.

      - record: slo:http_availability:compliance28d
        expr: |
          avg_over_time(sli:http_availability:rate5m[28d])

Target: 0.999 (99.9%). If this drops below 0.999, the SLO is breached.

Step 3: Compute burn rate at multiple windows.

      - record: slo:http_availability:burn1h
        expr: |
          (1 - avg_over_time(sli:http_availability:rate5m[1h]))
          / (1 - 0.999)
 
      - record: slo:http_availability:burn6h
        expr: |
          (1 - avg_over_time(sli:http_availability:rate5m[6h]))
          / (1 - 0.999)

Step 4: Compute remaining budget as a fraction.

      - record: slo:http_availability:budget_remaining
        expr: |
          1 - (
            (1 - avg_over_time(sli:http_availability:rate5m[28d]))
            / (1 - 0.999)
          )

A value of 0.5 means 50% of the budget remains. A value of 0 means the budget is exhausted. Feed this metric into your policy automation — PagerDuty status page updates, Jira ticket creation, Slack webhooks — to trigger the actions defined in the policy document automatically.

End-to-end SLO setup: from SLI definition to policy trigger

01
Define the SLI
Write a ratio metric: good events divided by total events over a rolling window. Use counters, not gauges — counters compose correctly into rate calculations.
02
Set the SLO target
Pick a target you can actually honour based on recent error rates. Start at your last-90-days baseline minus 10% headroom. Do not start with an aspirational number.
03
Compute the budget
Error budget = 1 minus the SLO. Express in both time and request counts so product management can reason about tradeoffs without a reliability lecture.
04
Configure burn rate alerts
Multi-window alerting: fast burn (1h + 6h) for pages, medium burn (6h + 3d) for warnings. Thresholds from Google SRE Workbook Table 5.2.
05
Write and sign the policy
Pre-agree on freeze thresholds (50%, 75%, 100%) and named owners. Both engineering and product sign off before the first incident. Automate the triggers where possible.
06
Review and recalibrate
After 90 days, check whether the SLO was breached and whether the breaches correlated with user complaints. If yes, fix the service. If no, relax the SLO. Repeat quarterly.

Source: Google SRE Workbook, Chapters 2–5

Common pitfalls and how to avoid them

These are the failure modes that appear repeatedly in production SLO rollouts, not in theory.

Pitfall 1: SLOs measured with maintenance windows excluded.

If you exclude planned maintenance from your error budget calculation, you are not measuring availability — you are measuring a fiction. Users do not care whether downtime was planned. Build maintenance into your budget. This forces you to invest in zero-downtime deploys and rolling upgrades rather than relying on scheduled maintenance windows as a workaround for poor upgrade practices.

Pitfall 2: SLO targets set by gut feel, not by data.

"We should be 99.99% reliable" sounds like a responsible aspiration. Run the table above — four minutes of monthly budget — then look at your actual deployment frequency and incident history. If you deploy ten times a week and each rolling deploy takes 30 seconds to complete across all instances, that alone is 300 seconds of exposure per month. You have already exceeded the 99.99% budget through normal operations before a single incident occurs. Set targets from measurement, not aspiration.

Pitfall 3: Alerting on the monthly number instead of burn rate.

Alerting when monthly compliance drops below 99.9% is equivalent to noticing a fire when the building has already burned down. By the time the monthly number moves, the damage is done and the budget is spent. Use burn rate alerting at the 1h and 6h windows. Monthly compliance is a lagging indicator for post-incident review. Burn rate is the operational instrument that fires while there is still time to act.

Pitfall 4: One SLO for the entire service regardless of user journey severity.

A payment confirmation endpoint and a reporting dashboard in the same service probably warrant different SLOs. Payment requests failing during checkout is a P0. A report generating slowly for an admin at 11 PM is not. Mixing them into one SLO produces a target that is either too tight (the admin case pushes you into freeze constantly) or too loose (the payment case never fires actionable alerts). Segment by user journey criticality.

Pitfall 5: Ignoring the cost asymmetry of reliability investment.

A 2024 ITIC survey found that more than 90% of mid-sized and large enterprises now lose over $300,000 per hour of downtime, with 41% reporting losses between $1M and $5M per hour. Those figures make the business case for reliability investment. But they also cut the other way: if your service is an internal developer tooling dashboard with 12 users, the cost of 30 minutes of downtime is not $300,000. Match the SLO to the actual blast radius of a failure. Overengineering low-criticality services pulls investment away from the high-criticality ones that actually need it.

Pitfall 6: No SLO on dependencies.

Your SLO is only as good as the weakest SLO in your dependency chain. If you have a 99.9% SLO but your upstream database has no SLO and historically delivers 99.7%, your SLO is aspirational. A service that depends on three others, each at 99.9%, has a theoretical ceiling of 99.7% (compounded). Either obtain SLOs from your critical dependencies, model their historical error rate into your own target, or build circuit breakers and graceful degradation so dependency failures do not propagate one-for-one into user-visible errors.

Pitfall 7: Treating SLO exhaustion as an engineering failure.

Sometimes a budget exhausts because the service had a bad month. Sometimes it exhausts because the SLO was calibrated too tightly for the service's current maturity level. The error budget policy should include a calibration review trigger: if the budget is exhausted two quarters in a row, the first question to answer is whether the SLO reflects what users actually need or whether it reflects what the team hoped to achieve. Adjusting a miscalibrated SLO is not failure — it is the feedback loop working as designed.

Reliability as a shared language

The most durable outcome from deploying SLOs is not better dashboards — it is a shared language between engineering and product that makes reliability tradeoffs explicit and automatic. The 2024 ITIC figure above ($300,000+ per hour for 90% of enterprises) quantifies the downtime cost that always existed. The prior cost that went unmeasured was subtler: the endless reliability meetings, the blame cycles after incidents, the product backlog that grew while engineers argued about priorities without a shared frame.

The error budget policy eliminates the argument. When the budget is healthy, features ship. When it is exhausted, reliability work takes priority. Both engineering and product agreed to this in advance, so the freeze is not a surprise or a power play — it is the rule that both teams chose under no pressure, when the service was green.

DORA's 2024 data shows that elite teams deploy on demand, restore service in under an hour, and keep change failure rates around 5%. Those outcomes are not the result of better infrastructure alone. They are the result of teams that have instrumented their services honestly, agreed on what "reliable enough" means, and built the feedback loops to enforce it — automatically, not by escalation.

Netflix's SRE practice (documented in multiple public engineering blog posts) applies exactly this model at scale: services own their SLOs, burn rate alerts trigger automated runbooks, and the error budget policy determines whether that service can accept new feature deployments that week. The SLO is not a constraint imposed by an ops team — it is a contract the service team negotiated with itself, and the budget is the resource that lets them move fast without apologising for it afterward. That distinction in ownership is what makes SLOs work at organisational scale: the team that sets the SLO is the team that lives with the consequences, so the incentive to calibrate honestly is built in.

What to remember

SLOs are a decision framework, not a monitoring configuration — the error budget policy that triggers on consumption is where the value actually lives.
Express SLIs as ratios (good events divided by total events), not as averages or percentiles — only ratios compose cleanly into an error budget.
Use multi-window burn rate alerting (1h plus 6h windows) rather than monthly compliance numbers; burn rate fires when there is still time to act.
Start SLO targets at your actual last-90-days baseline minus 10% headroom, then tighten quarterly as practices mature — never start with an aspirational number.
Write the error budget policy before the first incident and get both engineering and product to agree to it in writing — the policy only works if it was agreed under no pressure.
Model dependency error rates into your own SLO target or build graceful degradation; a service with unchecked dependency failures cannot honestly claim the SLO it publishes.

SLOs and error budgets: turning reliability into a number

SLI, SLO, SLA: the precise hierarchy

Choose SLIs the user actually feels

The error budget: worked numeric example

Burn rate: the signal that actually fires alerts

The error budget policy: where the decision framework lives

Configuring SLOs: a concrete end-to-end example

Common pitfalls and how to avoid them

Reliability as a shared language

Reading the field notes?