Observability at scale: when telemetry becomes a deletion problem

At small scale, observability is a data-collection problem: you don't have enough signal to answer "why is checkout slow right now?" At large scale it inverts completely. You have metrics, logs and traces pouring in from every pod, a monitoring bill that has quietly become a top-three line item, and still no fast answer. Collection is no longer the bottleneck — modern agents collect almost everything by default. The bottleneck is deletion. Cost scales with traffic; value scales with engineering discipline — and most teams have automated the former while neglecting the latter.

The canonical cautionary tale is Coinbase. A roughly $65M-per-year observability bill surfaced publicly only when a JPMorgan analyst reverse-engineered the figure from Datadog's Q1-2023 earnings call. The nuance matters: that was 2021 usage, settled in Q1 2022, during crypto hypergrowth — when the user base roughly quadrupled year-on-year and cost control was explicitly not the priority. Datadog later restructured the contract to retain the account; Coinbase explored a self-hosted Grafana/Prometheus/ClickHouse stack for control. The lesson is not "Datadog is expensive." It's that observability cost is a first-class architectural concern that compounds silently with growth, and that nobody notices until the invoice is shocking and lagging.

The shape of the problem

~$65M

One year's observability bill

Coinbase / Datadog, 2021 usage

Observability tools per company

down from 9 in 2024

17%

Share of infra spend on observability

median 10%

36%

Enterprises spending $1M+/yr on it

~4% exceed $10M/yr (Gartner)

Source: Datadog earnings call via The Pragmatic Engineer (2023); Grafana Labs Observability Survey (2025); Gartner (2025)

The Grafana Labs 2025 Observability Survey (1,255 respondents) tells the same story from the inside: companies run an average of eight observability technologies each, drawn from 101 distinct tools across the field — sprawl, not consolidation. Cost is now a primary tool-selection factor for 74% of teams, and the top three pain points are complexity (39%), signal-to-noise (38%) and cost (37%). None of those is "we don't have enough data."

The teams with the calmest on-call rotations are not the ones collecting the most. They are the ones who decided, on purpose, what not to collect.

— The senior position on observability cost

Treat the pipeline like a distributed system: backpressure and a budget

The default mental model is wrong. Most teams treat telemetry as a bucket — when it overflows, you buy a bigger bucket. The senior model treats it as a flow that, like any other distributed system, needs backpressure and a budget. The vendor is not infinitely elastic, and your bill is the proof. Adding a high-cardinality metric must stop being a frictionless one-line PR with an unbounded cost tail, and start being a reviewed trade.

default

Telemetry as a bucket

Every service ships its raw firehose straight to the SaaS
A new high-cardinality metric is a zero-friction PR
Cost tracks traffic, so growth produces shocking, lagging invoices
The only lever left is buying a bigger plan

senior

Telemetry as a flow with backpressure

One Collector chokepoint: sample, filter, drop tags, route
Hard caps at emission that degrade gracefully under load
A per-team telemetry budget, attributed like a compute budget
New high-cardinality series is a reviewed trade, not a freebie

The mental-model shift the rest of this article turns on

Backpressure needs a physical place to live, and that place is the collection layer — the OpenTelemetry Collector for traces/logs/metrics, or your Prometheus scrape config for metrics. This is the single egress chokepoint where you reduce cardinality, tail-sample traces, filter logs and route to multiple backends before data ever reaches a priced ingest tier. Crucially, you can do all of that without re-instrumenting application code.

Control point

Instrument once with OpenTelemetry; the Collector becomes the one place to apply backpressure — sampling, filtering, tag-dropping, routing — before data hits a priced backend. The backend becomes swappable, and the chokepoint becomes portable across vendors.Source: Pattern; OpenTelemetry reached CNCF Graduated status, 11 May 2026

This portability is no longer speculative. OpenTelemetry reached Graduated status in the CNCF on 11 May 2026 and is now the foundation's second-highest-velocity project behind Kubernetes itself, with more than 12,000 contributors from over 2,800 companies. Standardising on the Collector as your egress point is a bet the whole industry has already made. The common failure is adopting OTel as a vendor swap — instrumenting with it but still shipping the raw firehose straight through — which misses the entire point. The Collector is the backpressure valve, not a passthrough.

The budget half is organisational, not technical. Give each team a telemetry budget the way you'd give them a compute budget, and attribute spend back to the team that emits it. The moment a squad sees its own line item, "log everything just in case" turns into "what do we actually open during an incident?" — which is the only question that matters.

Cardinality is where the money goes

For metrics, cost is dominated by one thing: the number of active time series, which is the product of every label's distinct values. Put a user_id or a raw URL path on a metric and cardinality detonates combinatorially. This is the number-one cause of both Prometheus degradation and surprise SaaS bills, and the mechanics are unforgiving. Each active series costs roughly 3–4KB in the Prometheus head block, so a million active series is 3–4GB of RAM just for series overhead, scaling roughly linearly with unique label combinations. Worse, a series scraped once stays resident in memory for over 2.5 hours — so churned, short-lived labels (a pod name, a deploy hash) carry wildly disproportionate cost.

The reference example of doing this at extreme scale with discipline is Cloudflare, which runs Prometheus at around 4.9 billion active series across 916 instances; its largest instances hold roughly 30M series each and ingest 550,000 samples per second. They survive that not by buying elasticity but by engineering hard guardrails into every scrape: a ceiling of 64 labels per series, label names capped at 128 characters and values at 512, and a default sample_limit of 200 series per application. Exceed a limit and the excess series are dropped gracefully — the scrape still succeeds — rather than OOM-ing the server. That graceful degradation is what lets non-expert teams ship safely.

You bound cardinality at the point of emission with relabeling plus a hard cap:

scrape_configs:
  - job_name: api
    sample_limit: 5000               # hard cap: excess series dropped, scrape still succeeds
    metric_relabel_configs:
      - regex: 'user_id|request_id'   # nuke unbounded labels before ingestion
        action: labeldrop
      - source_labels: [__name__]
        regex: 'go_gc_duration_seconds.*'
        action: drop                  # drop an entire noisy metric family

A worked example: how one label becomes a five-figure mistake

Datadog bills custom metrics by cardinality at roughly $5 per 100 custom metrics per month — about $0.05 per series — where a "custom metric" is one unique combination of metric name, tag values and host. Watch what a single well-meaning tag does to a hot metric:

http.request.duration
  tags: service(30) × route_template(40) × status_class(5)
      = 6,000 series  →  6,000 × $0.05/mo  ≈  $300/mo
 
  + one new label: user_id (≈ 50,000 distinct values)
      = 6,000 × 50,000  =  300,000,000 series   (theoretical)
      a 50,000× cardinality multiplier from a single line of code

In practice you'd hit a cap or your wallet long before 300M series — which is exactly the point of having a cap. Volume discounts soften the absolute dollars at scale, but they do nothing to the multiplier: one tag turned a $300 line item into a runaway one. The mitigation when you genuinely need high-cardinality tags occasionally is to split ingestion from indexing — Datadog's "Metrics without Limits" (and equivalents) lets you ingest the full tag set but keep only a queryable subset billed, so cardinality you don't query stops driving the bill.

The signal pyramid: each fact in its cheapest correct home

The three pillars are not interchangeable, and most teams have the balance wrong. The 2025 Grafana survey shows metrics and logs at near-universal adoption but traces lagging — and traces are the one signal that actually answers "where did the time go?"

What teams actually collect — and the gap

Metrics95%

Logs87%

Traces57%

Profiles16%

Source: Grafana Labs Observability Survey, 2025 (1,255 respondents)

The discipline is to put each fact in its cheapest correct home. Low-cardinality metrics are the cheap, always-on base your alerts fire on. Tail-sampled traces are the middle — sampled, medium-cost, and the right place for the question "which service, which span?" High-cardinality logs are the apex: expensive forensic detail you keep narrow and short.

| Signal | Cardinality it tolerates | Cost driver | Answers | Typical retention | |---|---|---|---|---| | Metrics | Low (bounded labels) | Active series count | "Is it broken? How much, how fast?" | 13+ months, downsampled | | Traces (tail-sampled) | Medium (sampled) | Spans ingested / indexed | "Where did the time go across services?" | Days to weeks | | Logs | High (forensic) | Bytes ingested | "Exactly what happened on this request?" | Days hot, then cold archive |

These map onto a single incident path. You get paged on a symptom (a metric), open a dashboard of causes, narrow to a trace, and finish in the logs for the one request that matters. You move down the pyramid — and down the cost curve — only as far as each incident requires.

The incident path — the order you actually move through signals

01
SLO symptom breach
An error-budget burn-rate rule fires on what the user feels — latency or error rate.
02
Page
A human is woken only when both a long and a short window agree it's real.
03
Dashboard
Open the causes view — saturation, queue depth, recent deploys. Never paged on directly.
04
Trace
Tail-sampled traces show where the time or the error actually went across services.
05
Log
High-cardinality forensic detail, scoped to the one request that matters.

Source: Google SRE Workbook, 'Alerting on SLOs'

Tail-based sampling: keep the errors and the slow, drop the boring

Storing 100% of traces at scale means paying to keep millions of identical happy-path traces you will never open. The naive fix — head-based probabilistic sampling — is worse than it looks, because it makes the keep/drop decision at the start of a trace, before you know whether it errored or ran slow. It will cheerfully discard the one trace you needed.

Tail-based sampling makes the decision after the trace completes: keep all errors, keep everything above a latency threshold, and keep a small probabilistic baseline of the healthy majority for a representative picture. In the OpenTelemetry Collector that's a composed policy:

processors:
  tail_sampling:
    decision_wait: 10s           # default 30s; lower = less memory, less time to catch late spans
    num_traces: 100000           # in-memory traces held per collector — this drives RAM
    policies:
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Two operational facts decide whether this works. First, the defaults drive memory: decision_wait is 30s and num_traces is 50,000, and both hold spans in RAM while the Collector waits to see the whole trace. Second — and this is the one people miss — every span of a trace must reach the same Collector instance, or the sampler sees a fragment and decides wrong. That requires a trace-ID-aware load balancer upstream:

# upstream tier — pin every span of a trace to one sampling collector
exporters:
  loadbalancing:
    routing_key: traceID
    resolver:
      dns:
        hostname: otel-collector-sampling.observability.svc.cluster.local

Zendesk's published Datadog optimisation (begun February 2024) is the proof this holds spend flat through growth. They switched their monolith to single-root-span ingestion with enriched metadata instead of full trace trees, deduplicated logs, and curated 1,435 log exclusion filters — cutting one high-volume service's log volume 4x, holding APM ingested bytes flat while monolith traffic doubled, and roughly halving core-database telemetry TCO. The thesis underneath it is Pareto: a small slice of telemetry drives most of both the cost and the value, so you delete aggressively without losing troubleshooting fidelity.

Log with intent: structured, sampled, tiered

Logs are usually the single largest line item, and the dominant cost is ingestion, not query or retention. Cost analyses through 2025 found CloudWatch Logs ingestion reaching as much as ~90% of a team's observability spend — frequently traced to a single debug flag left on or a missing retention policy. Meanwhile a large share of stored logs is never queried after the first week. You are paying premium hot-tier rates to store noise.

Four habits fix most of it. Emit structured JSON so logs are queryable, not greppable. Sample high-volume success paths — you rarely need every 200 OK. Tier retention: a few days hot, then archive to cheap object storage. And push filtering into the pipeline so it happens before the priced ingest tier, not after:

processors:
  filter/logs:
    error_mode: ignore
    logs:
      log_record:
        - 'severity_number < SEVERITY_NUMBER_INFO'   # drop debug/trace chatter
        - 'IsMatch(body, ".*healthcheck.*")'         # drop health-check spam

The single most valuable alert you don't have is one on log-ingestion volume per service, so a debug flag that silently doubles your bill pages someone the same day — not at month-end.

Alert on symptoms, not causes

The fastest route to alert fatigue is paging on every cause. And fatigue is now a reliability risk, not an annoyance: in a 2026 survey of over a thousand SRE/DevOps practitioners, 77% of on-call teams get ten or more alerts a day, 57% say fewer than 30% are actionable, and 83% of engineers admit to ignoring or dismissing alerts at least occasionally. An alert nobody trusts is worse than no alert, because the real incident arrives wearing the same uniform as the noise.

The stakes on the other side are why this matters. ITIC's 2024 study put a single hour of downtime above $300k for more than 90% of mid and large enterprises, with 41% citing $1M–$5M or more per hour; Splunk and Oxford Economics estimated Global-2000 downtime at roughly $400B a year, around 9% of profits. You cut noise precisely so the page that means "checkout is down" actually gets answered.

So page only on what the user feels — latency, error rate, saturation — tied to an SLO, using Google's multi-window, multi-burn-rate method. You require both a long and a short window to breach (the short being roughly a twelfth of the long), so a fast spike pages quickly, sustained drift is confirmed, and transient blips are filtered. A burn rate of 1 would consume a whole 30-day budget in exactly 30 days; the alert tiers scale from there:

| Burn rate | Long / short window | Budget consumed | Action | |---|---|---|---| | 14.4× | 1h and 5m | 2% in an hour | Page | | 6× | 6h and 30m | 5% in six hours | Page | | 1× | 3d and 6h | 10% in three days | Ticket |

For a 99.9% SLO (a 0.1% error budget), the fast-burn page is a few lines of PromQL on pre-recorded ratios:

groups:
  - name: slo-burn
    rules:
      - alert: ErrorBudgetFastBurn
        expr: |
          (job:slo_errors:ratio_rate1h{job="checkout"}  > (14.4 * 0.001)
           and
           job:slo_errors:ratio_rate5m{job="checkout"}  > (14.4 * 0.001))
          or
          (job:slo_errors:ratio_rate6h{job="checkout"}  > (6 * 0.001)
           and
           job:slo_errors:ratio_rate30m{job="checkout"} > (6 * 0.001))
        for: 2m
        labels: { severity: page }
        annotations:
          summary: "Checkout error budget burning fast"

Causes — CPU, disk at 80%, pod restarts, queue depth — go on the dashboard you open after the page, never in the page itself. They are how you diagnose, not how you get woken.

The pitfalls, and the fix for each

These are the failure modes we see most often, each with the specific countermeasure.

Unbounded labels on metrics. user_id, raw URL paths, request_id, container_id, commit SHAs, full status codes — every distinct value mints a series that sits in RAM for hours. Fix: route templates not raw paths; status classes (2xx/5xx) not every code; high-cardinality detail into traces and logs; enforce sample_limit and metric_relabel labeldrop/drop; review top series by cost every sprint.

Treating the vendor as infinitely elastic. With no ingestion-side backpressure or budget, the bill tracks traffic and growth events produce lagging, shocking invoices — Coinbase being the extreme. Fix: an explicit per-team telemetry budget, cost attributed back to emitters, hard caps at emission, and a review gate on new high-cardinality metrics.

100% trace sampling, or naive head sampling. You either pay to store happy-path traces you never open, or randomly discard the errored trace you needed. Fix: tail-based sampling — keep all errors and slow traces plus a small baseline — with a trace-ID-aware load balancer upstream and collector memory budgeted via num_traces and decision_wait.

Alerting on causes instead of symptoms. Cause-based paging is the fastest path to the "fewer than 30% actionable" trap, and real incidents drown. Fix: page on user-facing symptoms tied to SLOs with multi-window, multi-burn-rate rules; demote causes to the post-page dashboard.

Logging everything with indefinite hot retention. Logs are usually the biggest line item, most are never queried after week one, and a single debug flag can silently double volume. Fix: structured JSON, sample success paths, tier retention (days hot, then cold archive), and filter in the pipeline before the priced tier — with an alert on per-service ingest volume.

Adopting OpenTelemetry as a vendor swap. Instrumenting with OTel but shipping the raw firehose straight through misses the entire point. Fix: make the Collector the single egress chokepoint and do cardinality reduction, tail sampling, log filtering and multi-backend routing there. Its CNCF graduation makes that work portable across vendors.