Kubernetes cost optimisation: a utilisation problem, not a price problem

Most teams attack the Kubernetes bill from the wrong end. They go shopping for discounts — Spot, Savings Plans, committed-use deals — before they have done anything about how much capacity they are wasting. It feels like progress because the percentage-off numbers are large. It is mostly theatre. The data is unambiguous: the Kubernetes bill is a utilisation problem, not a price problem. Across thousands of production clusters the average one runs at roughly 10% of the CPU and 23% of the memory it has been told to reserve, and over-provisioning — not list price — is the single most-named cause of overspend. A 77% Spot discount on a cluster that is 90% idle is still, overwhelmingly, paying for air.

This piece is about the four levers that actually move the bill, and — more importantly — the order to pull them in. Make requests honest, let the cluster scale in both directions (especially down, the step everyone skips), bin-pack onto fewer, fuller nodes, and only then apply Spot. The order is not a stylistic preference. Each lever shrinks the base the next one acts on, so discounts multiply efficiency and can never substitute for it. Every discount you buy is permanently capped by how much waste you fixed first. Two inconvenient facts prove the point, and we will keep returning to them: the single most popular "best practice" — CPU limits on everything — actively hurts both cost and reliability, while the most popular cost lever — Spot — pays off least until the boring work is done.

The average cluster: what you pay for vs what you actually use

CPU — provisioned100%

CPU — actually used~10%

Memory — provisioned100%

Memory — actually used~23%

Source: Cast AI, 2025 Kubernetes Cost Benchmark Report (2,100+ orgs, 2024 data)

That gap is not an exotic edge case — benchmarks across thousands of real clusters keep landing in the low double digits for CPU against what was requested. And it is getting worse, not better: in the same benchmark, average CPU utilisation fell from 13% the prior year to 10%. It is also the biggest single pool of savings on the table, and you reclaim it without touching reliability — which is exactly why it should come first.

The bill is a sizing problem before it is a pricing problem

Before you argue about Spot versus On-Demand, look at what the industry actually measures. Two large, independent datasets tell the same story from different angles.

The shape of Kubernetes overspend

70%

Name over-provisioning as their #1 overspend cause

vs sprawl 43%, no visibility 40%

49%

Saw costs rise after adopting Kubernetes

17% rose 'significantly'

65%+

Of workloads use under half their requested CPU & memory

Datadog dataset

10% / 23%

Average CPU / memory actually used

CPU down from 13% YoY

Source: CNCF FinOps for Kubernetes microsurvey (via InfoQ); Datadog State of Containers and Serverless; Cast AI 2025

Read those together. The CNCF FinOps for Kubernetes microsurvey has practitioners self-reporting over-provisioning as the top cause of overspend at 70%, well ahead of resource sprawl (43%) and lack of visibility (40%). Datadog, looking at telemetry rather than opinions, finds more than 65% of workloads using under half of both their requested CPU and memory. Cast AI, looking at billed capacity across 2,100-plus organisations, gets the 10% and 23% utilisation figures. Three methods, one conclusion: the capacity is reserved and idle, not scarce and expensive.

There is a second, even more damning number. Before you even compare requests to usage, compare what is provisioned on the nodes to what workloads have requested — and the average gap is 40% for CPU and 57% for memory. That is capacity you bought and never even scheduled a pod against: pure bin-packing and scale-down loss, sitting underneath the utilisation loss. And lest anyone believe efficiency arrives free with the platform, the CNCF data is blunt — adopting Kubernetes pushed cloud costs up for 49% of organisations (17% significantly), left them unchanged for 28%, and reduced them for only 24%. Industry estimates that put roughly a quarter of all cloud spend on underused resources are softer — an estimate, not a measurement — but they point the same way.

A 77% discount on a cluster that is 90% idle is still, overwhelmingly, paying for air.

— The thesis in one line

Why the order is the whole method

The four levers are not a menu you pick from. They are a sequence, and the sequence is the entire argument, because each stage shrinks the base the next stage discounts.

The four levers — in order, because they compound

01
Honest requests
Set requests from observed p90/p95 so the scheduler stops bin-packing against fiction. Closes the 10% / 23% gap.
02
Autoscale both ways
HPA for services, KEDA for queues and scale-to-zero, Karpenter for nodes — and tune the scale-DOWN path everyone leaves on defaults.
03
Bin-pack
Consolidate onto fewer, fuller nodes. A wide instance choice lets the scheduler pack tight instead of leaving every node idle.
04
Then Spot
Move interruption-tolerant work to Spot with real draining. It discounts whatever base the first three left — so it goes last.

A worked example makes the compounding concrete. Take an illustrative cluster of 40 On-Demand m5.2xlarge nodes — 8 vCPU and 32 GiB each — at roughly $0.38 per hour. That is about $11.2k a month in compute. Requests are set at perhaps two-and-a-half times real usage, so actual CPU sits near the benchmark 10%. Now sequence the levers. Right-size requests to p95 and let Karpenter consolidate, and the same workloads pack onto about 22 fuller nodes — roughly $6.2k, a 45% cut, with no change to what runs. Then move the stateless tier (about 70% of the fleet) onto Spot with On-Demand fallback, and the bill drops to about $3.1k — roughly 72% off the start, the bulk of it banked before Spot ever entered the picture.

Same cluster, same Spot discount — sequenced right vs Spot-first

Baseline — On-Demand, inflated requests~$11.2k/mo

Right-sized + consolidated (22 nodes)~$6.2k/mo

+ Spot on the stateless tier~$3.1k/mo

Spot-FIRST on the un-fixed 40 nodes~$5.7k/mo

Source: Illustrative worked example; AWS On-Demand pricing for m5.2xlarge

Now run it the popular way instead. Apply the same Spot mix first, to the un-fixed 40-node fleet, and you land near $5.7k. That looks like a win on a slide — about half off — and it quietly locks in 40 oversized nodes at a discount. The discount is computed on capacity you should have deleted. The end state of the disciplined order ($3.1k) is far below the end state of the Spot-first order ($5.7k), using the identical Spot discount, purely because the base it applied to was half the size. That is the compounding, in dollars.

Lever 1 — Make resource requests honest

Requests drive scheduling, and scheduling drives how many nodes you pay for. Set a request and the scheduler reserves that capacity on a node whether or not the pod ever touches it. Over-request out of caution — set requests to the peak you saw once, or to a round "safe" guess — and you have told the scheduler to bin-pack against fiction. It buys nodes for a peak that almost never recurs. That is precisely how a cluster ends up at 10% CPU.

The fix is to measure actual usage over a representative window — include the weekly peaks — and set requests to observed p90 or p95 plus modest headroom, not to a one-off spike.

resources:
  requests:
    cpu: "200m"        # from observed p90/p95, not a round "safe" guess
    memory: "256Mi"
  limits:
    memory: "256Mi"    # always cap memory; size it from real p95 usage
    # no CPU limit: keep the request for fair scheduling, skip the CFS throttle

That snippet is deliberately Burstable, and that is correct for a stateless service. Your Quality-of-Service class falls out of how requests and limits relate, and it decides who gets killed first when a node runs hot.

| QoS class | How a pod gets it | Eviction under node pressure | Where it fits | | --- | --- | --- | --- | | Guaranteed | Every container sets requests equal to limits, for both CPU and memory | Evicted last | Stateful / latency-critical pods you never want bumped | | Burstable | At least one request set, but not full requests-equal-limits | In between, ranked by usage above its request | The right default for stateless services | | BestEffort | No requests or limits at all | Evicted first | Almost nothing you run on purpose |

Note the nuance most "set Guaranteed for safety" advice skips: Guaranteed requires a CPU limit equal to the CPU request. So Guaranteed is the one place a CPU limit earns its keep — reserve it for the handful of pods that must be the last thing standing when a node is under pressure, and size that CPU limit from real peak so the throttling penalty below stays rare. For everything else, Burstable with honest requests and no CPU limit is both cheaper and more reliable. A common, sane community pattern: memory request at steady-state p95, memory limit at the request (or up to 1.5–2x it), and an alert at about 80% of the limit.

Do not set any of this from vibes. Run the Vertical Pod Autoscaler in recommendation-only mode and read its targets, or use Goldilocks (which creates those VPAs for you and renders a dashboard).

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  updatePolicy:
    updateMode: "Off"   # recommend only — read .status.recommendation, evict nothing

# Goldilocks: auto-create Off-mode VPAs for every workload in a namespace
kubectl label ns payments goldilocks.fairwinds.com/enabled=true

VPA has three live modes — Off (recommend, change nothing), Initial (set requests only at pod creation), and Recreate/Auto (evict pods to apply). Do not run VPA in Auto on the same workload and metric as an HPA; they will fight over the same signal. Recommendation mode plus human review is the safe default, re-tuned on a cadence as traffic shifts.

CPU limits are the trap

Here is the kicker the opening promised. CPU throttling is almost never caused by an exhausted node — it is caused by CPU limits. The Linux CFS scheduler enforces a CPU limit in fixed 100ms windows. A container with a 200m limit is granted roughly 20ms of CPU per 100ms window; the moment it spends that budget, the kernel pauses it for the remaining ~80ms — even if every other core on the node is idle. The symptom is a service with mysterious tail-latency spikes while node CPU graphs look calm. Teams misdiagnose it as an application bug and "fix" it by adding replicas, which adds cost and does not touch the cause.

Watch the throttling directly, not CPU percent:

rate(container_cpu_cfs_throttled_periods_total[5m])
  / rate(container_cpu_cfs_periods_total[5m])
  # alert when this stays above ~0.05 (5% of periods throttled)

Sustained throttling above roughly 1 to 5% of periods is a strong red flag on a latency-sensitive service. The fix is to drop the CPU limit (or raise it well above request) while keeping the CPU request, which still gives the scheduler a fair share to pack against — you lose the throttle penalty, not the fairness.

Lever 2 — Autoscale in both directions

Once requests are honest, the cluster has to follow the work — up and down. Three tools, each owning a different transition:

Horizontal Pod Autoscaler scales request-driven services from one replica to N. It never goes to zero.
KEDA owns the zero-to-one transition and scales on queue depth, event rate and other signals an HPA cannot see — including scaling genuinely idle services to zero.
Cluster Autoscaler or Karpenter makes nodes follow pods, in both directions.

Know the HPA defaults, because they shape its behaviour more than people realise: the control loop reconciles every 15 seconds, a 10% tolerance band suppresses tiny corrections, and — critically — scale-down waits a 300-second stabilisation window. Most teams tune scale-up aggressively and never touch the down path, so capacity added for a spike lingers far longer than the spike.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 3
  maxReplicas: 40
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # default; the knob most teams never touch
    scaleUp:
      stabilizationWindowSeconds: 0

For queue and event workloads, KEDA is the lever, and its real prize is scale-to-zero. It polls each trigger on a pollingInterval (default 30s), waits a cooldownPeriod (default 300s) after the last active trigger, then scales to zero — and hands the one-to-N range to a standard HPA it creates for you. The single line that unlocks idle savings is minReplicaCount: 0.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker
spec:
  scaleTargetRef:
    name: worker
  minReplicaCount: 0        # scale-to-zero when the queue is empty
  maxReplicaCount: 50
  pollingInterval: 30       # default
  cooldownPeriod: 300       # default; idle wait before going to zero
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: "https://sqs.eu-west-1.amazonaws.com/123456789012/jobs"
        queueLength: "20"
        awsRegion: "eu-west-1"

Lever 3 — Bin-pack onto fewer, fuller nodes

Right-sized requests are necessary but not sufficient. You also need the node layer to remove the slack — to notice when nodes are half-empty and consolidate the pods onto fewer, fuller machines. This is where the 40% CPU and 57% memory provisioned-versus-requested gap lives, and it is the part most teams never tune.

Bin-packing

Consolidation in one picture: many nodes each running at ~10% become a handful of nearly-full nodes plus a Spot tier — the same workloads, far fewer machines.

Classic Cluster Autoscaler removes a node only when its CPU and memory requests fall below scale-down-utilization-threshold (default 0.5, i.e. 50%) for scale-down-unneeded-time (default 10 minutes). Note the trap baked into that default: if your requests are inflated, utilisation never crosses the 50% threshold and the node is never reclaimed — which is exactly why requests come first. Karpenter replaces that model with continuous consolidation, which is both more aggressive and easier to reason about.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: stateless-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]      # Spot first, On-Demand fallback
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["c6i", "c7i", "m6i", "m7i", "r6i", "r7i"]  # wide pool packs tighter
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized   # not just WhenEmpty
    consolidateAfter: 1m   # 5m steady / 15m bursty / 30s batch

The decisive setting is consolidationPolicy: WhenEmptyOrUnderutilized rather than the timid WhenEmpty. WhenEmpty only removes a node once every pod has already left; WhenEmptyOrUnderutilized actively repacks underused nodes and disrupts running pods to do it, gated by consolidateAfter so it does not thrash. Widening the instance choice matters too — the more families, sizes and architectures Karpenter may pick from, the tighter it can pack and the cheaper the node it lands on.

Tuning this is not a one-time toggle. Cast AI's own seven-day benchmark — vendor-run and self-favourable, so read it as a directional point rather than a neutral comparison — found basic consolidation cutting about 9% versus baseline, tuned consolidation about 16%, and their own autoscaler about 43%. The honest takeaway is not the headline number; it is that how you configure consolidation materially changes the result, so it is worth measuring rather than leaving on defaults. Watch it happen live while you tune:

eks-node-viewer --resources cpu,memory   # live node utilisation and hourly cost

~10% CPU

Before — sprawl

5 nodes, each roughly 40% full on requests
Requests inflated 2–3x over real usage
Autoscaler only ever scaled up
Every node a little idle, all the time

fewer, fuller

After — consolidated

3 fuller nodes after right-size + consolidation
Stateless tier shifted to Spot
Karpenter WhenEmptyOrUnderutilized reclaims drift
Roughly a third off the bill, governance left in place

Bin-packing is reliability-neutral: same workloads, fewer nodes, governance kept in place

Lever 4 — Then, and only then, Spot

Now — and only now — Spot pays. With requests honest and the fleet packed, the discount applies to a base that is already as small as you can make it.

Spot: the last lever, and a real one

59%

Avg compute saved, mixed On-Demand + Spot

across thousands of clusters

77%

Avg compute saved, Spot-only clusters

discount on whatever base remains

under 5%

Historical interruption rate

all Regions and instance types

2 min

Interruption notice to drain

react to the signal, don't avoid it

Source: Cast AI 2025 Kubernetes Cost Benchmark Report; AWS EC2 Spot best practices

The realised numbers are what to plan against: across thousands of clusters, mixing On-Demand and Spot saved about 59% of compute on average, and Spot-only clusters about 77%. AWS advertises "up to 90% off," but treat that as a marketing ceiling, not a planning figure. The reliability story is also better than its reputation: the historical interruption frequency across all Regions and instance types sits under 5%, and every instance gets a two-minute interruption notice. The engineering task is reacting to that signal cleanly, not avoiding Spot.

Reacting cleanly means a few specific things. Use AWS's price-capacity-optimized allocation (its recommended default) and allow many instance families, sizes and Availability Zones — a wider pool means a lower interruption rate and a tighter bin-pack. Let Karpenter own interruptions natively: since v0.19.3 it consumes the interruption signal from an SQS queue and pre-spins a replacement on the two-minute notice. Do not also run the AWS Node Termination Handler alongside it — they conflict, and NTH is only still needed for the Rebalance Recommendation signals Karpenter does not consume. Protect draining with a PodDisruptionBudget so a reclaim (or a consolidation pass) can never take the whole service at once, and keep anything stateful or genuinely uninterruptible on On-Demand.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: web

Spot in production, for real

This is not a dev-only trick. Two public engineering write-ups show it carrying production load.

Grover, the consumer-electronics rental company, migrated from Cluster Autoscaler to Karpenter on Amazon EKS specifically to push Spot hard in production while staying reliable. They run a broad NodePool (Nitro instances, large through 4xlarge, many families, Spot prioritised with On-Demand fallback), a 30-day node refresh for patching, and an SQS interruption queue for graceful draining. The result, per the AWS Containers Blog: 70–80% of production instances on Spot, about a 25% increase in Spot usage across all EKS clusters after adopting Karpenter, and a production cutover done in a single day with zero downtime by running Cluster Autoscaler and Karpenter in parallel during the transition.

Tinybird, a real-time analytics SaaS, runs a two-NodePool design on EKS: a stateful/critical pool on On-Demand under Savings Plans, and a stateless pool on Spot with automatic On-Demand fallback that, by their account, never had to trigger. With x86 and ARM across three AZs, Bottlerocket nodes, Karpenter itself on Fargate, and KEDA plus HPA scaling on Prometheus metrics, they report roughly a 20% reduction in their overall AWS bill, up to 90% saved on CI/CD workloads specifically, node provisioning under 45 seconds, and pods reaching ready about two minutes faster. A third example, Vorwerk, is cited in secondary roundups at around a 60% compute decrease across environments via Karpenter consolidation and instance-type swaps — treat that one as indicative rather than primary.

Make the spend legible before you optimise it

There is a fifth lever that is really a precondition: you cannot fix what no one owns. The CNCF respondents are emphatic on this — they rate team and individual awareness and self-discipline (68%) as the single best way to control overspend, ahead of tooling (48%). That is impossible without showing each team its own number.

Cost allocation is what turns "the cluster is expensive" into "this service is expensive, and that team owns it." OpenCost — the CNCF specification, with Kubecost as the commercial superset — attributes node cost down to namespace, controller and label. The most-cited monitoring stack in the CNCF survey is AWS Cost Explorer (55%), then Kubecost (23%) and OpenCost (11%). Enforce namespaces and labels, attribute the cost, and put each service's spend in front of the team that owns it. Then leave that governance in place, because the bill only stays down if someone keeps watching their own number.

The pitfalls that quietly cost you

Most of the damage comes from a small set of repeatable mistakes. Each has a specific fix.

| Pitfall | Why it backfires | The fix | | --- | --- | --- | | CPU limits on everything | CFS throttles in 100ms windows, so pods pause mid-request while the node sits idle — tail-latency spikes teams misread as app bugs and "fix" with more replicas | Keep CPU requests, drop CPU limits on latency-sensitive services; alert on the throttling ratio, not CPU percent | | Requests set to a one-off peak | The scheduler reserves that peak forever and bin-packs against fiction — this is the 10% CPU number | Set from observed p90/p95 plus headroom; let VPA/Goldilocks recommend and re-tune on a cadence | | Memory limit too low, or absent | Too low and the kernel OOMKills the container (exit 137); absent and one leaky pod can starve the whole node | Always set a memory limit; for predictable workloads size it from p95 and alert at ~80% of it | | Autoscaling that only scales up | Nodes added for a spike never leave; capacity that never contracts is a slower way to overpay | Tune the scale-down path: HPA stabilisation, CA scale-down-unneeded-time, Karpenter consolidateAfter | | "Karpenter won't scale down" | Naked pods, kube-system pods with no PDB, local/hostPath storage, strict anti-affinity, or inflated requests below the 50% threshold all pin nodes | Audit the documented blockers, add PDBs to system pods, avoid naked pods and local storage on candidates, fix requests first | | Spot with no disruption engineering | One family in one AZ concentrates reclaim risk; with no PDB or interruption queue, drains outrun replacements and a cost win becomes an incident | price-capacity-optimized across many families and AZs, Karpenter native interruption (SQS), PDBs, stateful on On-Demand | | Discounts before right-sizing | A 77% Spot saving on a 10%-utilised cluster still pays for ~90% air; you lock mis-sized capacity in at a discount | Hold the order: requests, autoscaling, bin-packing, then Spot | | No per-team cost ownership | Savings that are everyone's job are no one's; CNCF respondents rank awareness (68%) as the top control | Allocate cost per namespace/label with OpenCost/Kubecost and put each team's number in front of it |

None of this is clever. We have taken roughly a third off an over-provisioned cloud bill doing exactly this kind of unglamorous tightening — honest requests, two-way autoscaling, consolidation, Spot for the right tier — and, just as important, left the cost allocation in place so it stays down. The patterns are not the hard part. The order, and the discipline to hold it, are.

What to remember

The bill is a utilisation problem, not a price problem: the average cluster uses ~10% of CPU and ~23% of memory it reserves, and over-provisioning is the #1 named cause of overspend (70%).
Hold the order — requests, autoscaling, bin-packing, then Spot — because each lever shrinks the base the next one discounts. The same Spot deal on a fixed cluster beats it on an un-fixed one.
Set requests from observed p90/p95, keep CPU requests but drop CPU limits on latency-sensitive services, and always cap memory. CPU limits cause CFS throttling even on idle nodes.
Tune the scale-DOWN path everyone skips — HPA stabilisation, KEDA scale-to-zero, Karpenter WhenEmptyOrUnderutilized — then move interruption-tolerant work to Spot with PDBs and native interruption handling.
Make spend legible per team with OpenCost/Kubecost; CNCF practitioners rate awareness and discipline (68%) above tooling, and the savings only stick with an owner.