ClimsTech
SRE & reliability21 May 2025

On-call that doesn't burn people out

Bad on-call is a quiet attrition machine. Good on-call is mostly quiet — and getting quieter. The difference is almost entirely about which alerts you allow.

ClimsTech Engineering · 19 min read

On-call is the reliability forcing function nobody talks about honestly. The dashboards get built, the runbooks get written, the postmortem template gets filled in — and then at 2:47 a.m. an alert fires for the fourteenth time that month for something nobody has ever fixed, and your best engineer starts mentally drafting their resignation letter. Bad on-call isn't an inevitable tax on operating software. It is a discipline failure that compounds quietly until the people carrying it leave.

The engineers who have lived through genuinely well-run on-call rotations describe the same experience: it is mostly quiet, incidents are rare and well-understood, and the pager is measurably getting quieter every quarter. That outcome is achievable on any stack. It requires treating every unnecessary page as a production bug — not weather, not the cost of doing business, but a defect that should embarrass the team until it is fixed. The teams that do this have better retention, faster MTTR, and a reliability posture that compounds over time rather than eroding under operational debt.

DORA elite reliability benchmarks

On-demand

Deploy frequency

elite performers

<1 hour

Time to restore

elite performers

~5%

Change failure rate

elite performers

<1 hour

Lead time for changes

elite performers

Source: DORA State of DevOps, 2024

The economics of a noisy pager

Before fixing alerting, it helps to understand what noisy alerting actually costs — because teams routinely undercount it and then wonder why their senior engineers are leaving.

The direct cost is MTTR degradation. When responders are trained by experience that most pages are noise, they slow down. The first instinct becomes "probably the flap again" rather than "this is real, act now." The night that instinct is wrong is the night a P0 runs for two hours while someone waits for it to self-resolve. Research cited across the observability industry puts enterprise alert volumes in the range of thousands per week for mid-size platforms, with a commonly reported estimate that only around 3% of those alerts require immediate human action (incident.io, 2024). Even treating that figure conservatively, the ratio is the problem: a responder processing noise vastly outnumbers a responder acting on signal.

The indirect cost is attrition. On-call is one of the highest-leverage factors in senior engineer retention. The 2024 DORA State of DevOps report links burnout directly to operational stressors that are resistant to mitigation even in environments with strong leadership. Pager noise and poorly scoped incidents are among the most consistent contributors. PagerDuty's 2024 State of Digital Operations study found a 16% year-over-year increase in enterprise incidents, driven in part by accelerated system complexity from AI adoption. More incidents mean more pages. If those pages are not being systematically made more actionable, the cognitive load compounds regardless of how good the tooling is.

The compounding cost is alert desensitisation. This is the most insidious form of damage. An engineering culture that has normalised noisy alerts does not just respond slowly — it stops noticing. When an alert has fired every day for three weeks without consequence, the human brain correctly learns it can be ignored. The problem is that classification happens before the brain consciously inspects the content. A real incident that superficially resembles a known noisy alert gets the same treatment as the noise. Desensitisation is how incidents that start as minor become major: the signal was there, the responder believed it was noise, the incident ran.

The cardinal rule: actionability is binary

The single most important principle in on-call design is also the easiest to state and the hardest to enforce consistently: if something pages a human, there must be a specific action that human can take right now. Not eventually. Not "keep an eye on it." Right now.

This is binary. An alert is either actionable or it is not. If it is not actionable, it has no business existing as a page. It might belong on a dashboard. It might belong in a weekly operations digest. It might belong in a ticket system. It does not belong in a pager that wakes someone up.

Two tests determine actionability before an alert reaches production:

The runbook test: Does a link to a concrete runbook exist that tells the responder exactly what to do? If you have to write "investigate" anywhere in that runbook, you have not finished it. "Check the logs" is not an action. "Check kubectl logs -n checkout -l app=checkout --since=5m for connection errors, then follow the Redis section of this runbook" is an action.

The user impact test: Is a user experiencing degraded service right now, or is this a leading indicator that will cause impact within minutes? If the answer is "we are not sure yet," it is a dashboard metric — not a page.

# BAD: no prescribed action, no user impact established
- alert: HighCpuUtilisation
  expr: node_cpu_seconds_total{mode="idle"} < 0.2
  for: 5m
  annotations:
    summary: "CPU is high on {{ $labels.instance }}"
 
# GOOD: user impact defined, action prescribed, runbook linked
- alert: CheckoutErrorRateHigh
  expr: |
    rate(http_requests_total{job="checkout", status=~"5.."}[5m])
    / rate(http_requests_total{job="checkout"}[5m]) > 0.02
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "Checkout 5xx above 2% (current: {{ $value | humanizePercentage }})"
    runbook_url: "https://runbooks.internal/checkout-error-rate"
    description: "Users failing at checkout. Check pod health and upstream deps per runbook. Rollback procedure in section 3."

The difference is not cosmetic. The second rule encodes a user-visible symptom (checkout errors above 2% for two minutes), a meaningful waiting period to filter transient spikes, and a direct link to the exact procedure a responder follows. The first rule encodes a measurement. Measurements belong on dashboards.

before

Cause-based alert

  • 'CPU above 80%' — fires daily, sometimes hourly
  • Responder checks dashboard; CPU has settled; dismisses
  • No action taken — happens again tomorrow
  • Teaches the responder that pages are noise
  • Real incident buried in the same visual noise
after

Symptom-based alert

  • 'Checkout error rate above 2% for 2m' — fires rarely
  • Responder has a runbook: pods, upstream deps, rollback
  • Every page corresponds to verified user pain
  • False-positive rate drops; the responder trusts the signal
  • MTTR falls because the responder acts without hesitation
Cause-based vs. symptom-based alert designSource: SRE principles; Google Site Reliability Engineering, O'Reilly, 2016

Symptom-based alerting in practice

Alerting on what the user experiences — rather than what the system is doing internally — is the central prescription of Google's Site Reliability Engineering book and every serious observability practitioner since. It is also routinely violated in production because cause-based alerts feel safer: "if I watch every resource metric, nothing can surprise me."

The problem is that the space of internal causes is effectively unbounded, and most causes produce no user impact at all. The space of user-visible symptoms is small and everything in it matters. Here is a practical mapping:

| User symptom | Alert on | Use dashboard for (not a page) | |---|---|---| | Requests failing | HTTP 5xx rate, gRPC error rate | Individual pod restarts, OOM kills | | Requests slow | p99 latency against SLO burn rate | CPU%, memory%, JVM GC pause time | | Feature unavailable | Synthetic probe failure, health endpoint returning non-2xx | Number of replicas below desired | | Data loss risk | Replication lag exceeding RPO threshold | Disk utilisation below 85% | | Downstream cascade | Circuit breaker open, queue depth at consumer limit | Thread pool size, connection pool size |

The right column is full of legitimate monitoring signals. They belong on dashboards reviewed during business hours. None of them should wake anyone up unless you have empirically — not theoretically — correlated them to user impact in your specific system.

SLO-based burn rate alerting

The most principled modern approach is burn rate alerting against defined SLOs. Rather than picking an arbitrary threshold, you define a target availability (say 99.9% over 30 days), calculate the error budget that represents, and alert when that budget is burning at a rate that will exhaust it faster than acceptable.

A fast-burn alert in Prometheus recording rules and alert rules:

# Recording rule: 1-hour error rate ratio
- record: job:http_error_ratio:rate1h
  expr: |
    1 - (
      sum(rate(http_requests_total{status!~"5.."}[1h])) /
      sum(rate(http_requests_total[1h]))
    )
 
# Alert: burn rate exceeding 14.4x normal for this SLO
# At 14.4x, the 30-day budget burns at roughly 2% per hour
# Source: Google SRE Workbook, Chapter 5 — Alerting on SLOs
- alert: ErrorBudgetFastBurn
  expr: job:http_error_ratio:rate1h / (1 - 0.999) > 14.4
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "SLO fast-burn: error budget burning at {{ $value | humanize }}x normal rate"
    runbook_url: "https://runbooks.internal/slo-fast-burn"

This approach has two structural advantages. First, it directly encodes user impact: error budget is defined in terms of the SLO you have committed to users. Second, it self-adjusts: as reliability improves, the budget burns slower, and pages become rarer without anyone manually adjusting thresholds.

Typical alert distribution in teams without triage discipline
Actionable pages (real user impact)~15%
Valid leading indicators (ticket or log tier, not page)~30%
Noise — flapping, non-actionable, or redundant~55%
Source: Representative pattern; cf. incident.io research, 2024; PagerDuty State of Digital Operations, 2024

The alert taxonomy: page, ticket, log

A useful organising principle from Google's SRE practice is a three-tier classification: every monitoring trigger should be explicitly assigned to one of three tiers before it touches production alerting.

Page: Requires immediate human action. User impact is occurring or imminent. Fires the pager.

Ticket: Something is degrading or will degrade, but not imminently. Goes into an issue tracker. Reviewed the next business day at latest.

Log: Informational. Emitted to observability for human review when they choose to look. Never sends a notification.

Most teams have all three tiers but never make the assignment explicit, which means the classification degrades over time. Tickets drift into pages, informational signals drift into tickets, and the pager ends up carrying traffic it was never designed to handle. Codifying the tier in alert rule labels forces the discipline:

labels:
  severity: page    # requires immediate action; wakes on-call
  # severity: ticket  # create issue; review next business day
  # severity: log     # emit to observability; no notification

The tier label should be code-reviewed. If a PR adds severity: page, the reviewer should be able to answer without hesitation: "What does the responder do in the first five minutes?" If they cannot, the tier is wrong and the PR should not merge.

The weekly alert triage ritual

Fixing alerting is not a one-time project. It is a weekly practice. Teams that treat it as a project fix alerting once and watch it rot back to noise over eighteen months as the system changes and new alerts accumulate without review.

The practice is simple: every week, review what paged. For each alert that fired, ask three questions:

  1. Was it real — did it represent actual user impact?
  2. Was it actionable — did the responder have a clear, prescribed action?
  3. Was the action taken — did the responder do anything, or did they dismiss and go back to sleep?

Anything that fails any of these questions is a defect to file immediately. Not a backlog item. A defect, with the same urgency classification as a production bug affecting users. Assign it. Fix it or delete it before the next review.

Weekly on-call review cadence
  1. 01

    Pull the week's pages

    Export from PagerDuty, Grafana OnCall, or equivalent. Record total pages, unique alert names, pages with no action taken (noise), and any alerts that fired more than three times.

  2. 02

    Triage each firing alert

    For each unique alert: was it real, actionable, and acted upon? Any alert that fails one criterion gets a defect ticket filed before the meeting ends. The three questions take under a minute per alert.

  3. 03

    Classify each defect

    Every defect is one of: (a) threshold needs tuning, (b) alert is redundant with a downstream symptom alert, (c) root cause is fixable and the alert should be deleted after the fix lands, or (d) this belongs in ticket or log tier instead.

  4. 04

    Fix or delete this sprint

    Alert defects from the weekly review are first-class engineering work. Teams that defer them to 'tech debt' quarters are accepting a compounding noise problem and the attrition risk that comes with it.

  5. 05

    Track the trend line

    One metric matters above all others: total pages per week, trended over a rolling quarter. It should be going down. If it is not, the triage process is not working — diagnose whether defects are being filed but not fixed, or not being filed at all.

Source: Google SRE Book; ClimsTech operational practice

A numeric illustration of what this looks like in practice: a team beginning a triage programme typically sees something like 80 pages in the first week, with perhaps 12 genuinely actionable and 68 noise. After 8 weeks of disciplined triage — fixing thresholds, deleting redundant cause-based alerts, tuning for: durations to eliminate flapping, migrating valid signals to the ticket tier — the same system typically produces 20–30 pages per week with a much higher actionable ratio. The signal-to-noise improvement matters more than the raw count reduction, because the process forces precision about what "actionable" actually means in your environment.

Making the rotation humane

Alert quality is necessary but not sufficient. A rotation that asks too much of too few people burns them out even with excellent alerting, because the cognitive cost of being on call — sleeping near the phone, fragmenting personal time, the background hum of availability — accumulates independent of how often the pager actually fires.

Google's SRE book articulates an operational target that remains the best-known concrete bound: on-call engineers should handle no more than two significant incidents per 12-hour shift. Beyond that, response quality degrades — runbooks get skipped, shortcuts get taken, post-incident work gets deferred until it is too deep to bother with. This is not about engineer resilience; it is about the cognitive requirements of high-quality incident response. A tired responder taking shortcuts creates the conditions for the next incident.

Shift length and frequency: Weekly rotations are common but often too long for systems with even moderate incident rates. With four or more people in a rotation, bi-weekly rotations keep each individual's on-call exposure to a manageable frequency. Any rotation structure that puts someone on call more than one week in three is loading the rotation.

Follow-the-sun: If your team spans time zones, use it deliberately. A UK engineer on call during UK business hours and a US engineer covering the US equivalent means nobody is routinely woken up at 3 a.m. The handoff requires written discipline — a status update at shift change covering open investigations, any systems in a degraded-but-stable state, and anything the incoming responder should watch — not a verbal "nothing going on." Verbal handoffs lose context. Written handoffs create a record.

Compensation and recognition: On-call is work, including nights and weekends. Teams that explicitly compensate for on-call burden — whether through additional pay, time in lieu, or a clearly stated and consistently applied policy — have materially lower attrition from the rotation than teams where it is absorbed into salary and never acknowledged. The specific mechanism matters less than the acknowledgement that carrying the pager has a real cost.

Runbook quality as an entry requirement: A runbook that says "investigate the logs" is not a runbook. Before any alert reaches the production pager, require a runbook that a junior engineer who has never touched the service could follow at 3 a.m. after being woken up. Test this literally: hand it to someone who did not write it, have them read it cold, and ask if they know what to do. Confusion in the review means the runbook is wrong — not the engineer.

The postmortem loop that actually quiets the pager

The purpose of a postmortem is not to produce a document. It is to make the next incident less likely and the one after that easier to handle. Teams that run postmortems as a process ritual without feeding outputs into engineering work pay the time cost of the process with none of the reliability dividend.

A postmortem that actually reduces future pages produces three concrete outputs. Ideally all three; at minimum, one.

A systemic fix: A code, configuration, or architecture change that removes or substantially mitigates the root cause. This is the highest-value output and the most frequently skipped one. If a postmortem produces only documentation and monitoring changes, the underlying cause is still present and will produce another incident.

A runbook improvement: The responding engineer writes down what they did not know at 3 a.m. that they wish they had. Minimum viable output: add a "what this incident looked like and what we did" entry to the relevant runbook, so the next responder is faster and more confident.

An alert modification or deletion: If an alert fired but added no value — it fired after user-visible impact was already caught by another alert, or the prescribed action turned out to be wrong — it gets modified or deleted. This is the direct mechanism by which postmortems make the pager quieter over time.

# Illustrative postmortem workflow for a checkout service incident
# Root cause: Redis connection pool exhausted under burst load
 
# 1. Systemic fix: increase pool size, add connection timeout
kubectl set env deployment/checkout-service \
  REDIS_POOL_MAX=50 \
  REDIS_CONNECT_TIMEOUT_MS=500
 
# 2. Runbook update: add the Redis pool pattern to the checkout runbook
# runbooks/checkout-error-rate.md — append to "Common causes":
#
# ### Redis connection pool exhaustion
# Symptom: 5xx spike with "connection pool exhausted" in logs
# Check: kubectl exec -it <pod> -- redis-cli -h redis info clients
# Action: kubectl set env deployment/checkout-service REDIS_POOL_MAX=<current+20>
# Escalate if: pool is at max and Redis latency is also elevated (separate issue)
 
# 3. Alert audit: the 'RedisConnectedClientsHigh' alert fired 8 minutes
# after the checkout error rate alert caught the incident.
# It added no signal. Delete it — the symptom alert caught this first.
# git rm monitoring/alerts/redis-connected-clients-high.yaml

The three-output requirement prevents postmortems from becoming a writing exercise. Every postmortem in this format has a direct link to engineering work, a direct link to responder knowledge, and a direct link to the pager getting quieter.

The only metric that proves reliability is genuinely improving — not just better-watched — is on-call getting quieter, quarter over quarter.
Principle articulated in Google Site Reliability Engineering, Beyer et al., O'Reilly, 2016

Measuring on-call health

You cannot improve what you do not measure, and most teams measure on-call incorrectly. They track uptime and MTTR — useful but lagging indicators — while ignoring the health of the rotation itself. The metrics that matter for on-call health are distinct from, though related to, incident metrics.

| Metric | What it measures | Target direction | |---|---|---| | Pages per engineer per week | Rotation load and noise level | Trending down | | Actionable page ratio | Signal quality of the alerting system | Trending toward 100% | | Time to first meaningful action | Runbook clarity combined with responder availability | Stable or declining | | MTTR (mean time to restore) | Incident resolution effectiveness | Trending down | | Rotation equity index | Whether on-call load is evenly distributed | Variance narrowing | | Postmortem completion rate | Whether learning is being captured and acted on | 100% for P0 and P1 | | Alerts modified or deleted per quarter | Active triage practice versus passive acceptance | Non-zero; sustained |

The leading indicator that matters most is pages per engineer per week, trended over a rolling quarter. Elite on-call rotations are measurably getting quieter — not because incidents stop happening, but because the team is continuously converting incident learnings into reliability improvements. If that number is flat or rising, something in the postmortem-to-engineering-work loop is broken: either defects are being filed but not fixed, or the weekly review is not happening, or systemic fixes are being deferred.

DORA's elite performers — change failure rates under 5%, time to restore under one hour — do not produce those outcomes through better tooling alone. They come from teams that treat every incident as debt to be paid, not cost to be absorbed. The compounding effect is real: a team that systematically reduces on-call load builds the reliability muscle over time, while a team that accepts noise as permanent slowly loses the engineers who care most about doing the work properly.

On-call health targets worth tracking

2

Max incidents per 12h shift

quality threshold

>80%

Actionable page ratio

healthy signal floor

Trending down

Pages per week

the proof of improvement

100%

P0/P1 postmortem rate

learning discipline

Source: Google SRE Book; DORA State of DevOps, 2024; ClimsTech operational practice