The average Kubernetes cluster uses 8% of the CPU it provisions. Not a typo: according to Cast.ai's 2026 State of Kubernetes Optimization Report, which measured tens of thousands of production clusters across AWS, GCP, and Azure, the median effective utilisation before any optimisation is roughly 8%. CPU overprovisioning increased year-over-year to 69%; the previous year's benchmark found 99.94% of measured clusters over-provisioned. Pod autoscaling—HPA, KEDA, VPA—gets the conference talks and the engineering attention. But pods that can't schedule don't help anyone, and idle nodes that never get reclaimed are where the bill compounds quietly. Node autoscaling is the lever most teams have half-configured: scale-up works, scale-down doesn't. This post is about understanding both tools deeply enough to make scale-down work correctly too.
How the scheduler creates demand for nodes
Before comparing tools, it's worth being precise about what they're reacting to.
When Kubernetes schedules a pod, it runs a two-phase binding decision. The filter phase eliminates every node that cannot satisfy the pod—insufficient CPU, insufficient memory, missing node selector labels, violated taints, or topology spread constraints. The score phase ranks the surviving nodes by a weighted set of criteria: bin-packing efficiency, resource balance, pod affinity, and others. If the filter phase eliminates all nodes—because none can satisfy the pod's requirements—the pod transitions to Pending with reason Unschedulable.
Both the cluster autoscaler and Karpenter watch for Unschedulable pods and treat them as the trigger to add capacity. The difference is everything that follows: how fast, how flexible, and how well-matched the provisioned capacity is to what those pods actually need.
Scale-down is the mirror problem. A node can only be safely removed after its pods are evicted and rescheduled elsewhere. Both tools detect idle nodes and initiate drain-and-terminate sequences. The difference is that only Karpenter also performs active consolidation: proactively moving workloads off underutilised nodes to empty them sooner, rather than waiting for workloads to naturally depart.
The gap between HPA scaling pods up and a new node becoming Ready is where user-facing latency accumulates during traffic spikes. The gap between workloads shrinking and nodes actually terminating is where the monthly bill accumulates overnight, every weekend, and every time a batch job finishes. Getting both halves right requires understanding the internal loops of whichever tool you run—and understanding that neither tool can compensate for resource requests that don't reflect reality.
The cluster autoscaler: architecture and real limits
The cluster autoscaler has been production-stable since Kubernetes 1.4. It supports every major cloud provider and is the default choice on most managed Kubernetes services. Its mental model is deliberately simple: you pre-define node groups (AWS Auto Scaling Groups, GCP Managed Instance Groups, Azure VM Scale Sets), and the autoscaler grows or shrinks those groups by adjusting their desired capacity.
Scale-up in detail
The CA runs a scan loop on a configurable interval (10 seconds by default). On each pass it checks for Unschedulable pods. For each pending pod, it simulates scheduling against the current cluster plus a hypothetical new node from each eligible node group. It picks the cheapest node group whose added node would make the pod schedulable. Then it increments the group's desired capacity and waits.
The wait is the bottleneck. An ASG resize request enters the cloud provider's instance provisioning queue. The new instance must boot its OS, execute user-data scripts (including kubelet and CNI setup), and register with the control plane. On AWS with a cold launch, this typically takes 3 to 4 minutes from resize request to node Ready. During those minutes, pending pods sit unscheduled.
Pre-warming via warm pools (AWS) or surge instances (GKE) can reduce this to under 60 seconds for pools of pre-provisioned instances, but warm pools add their own operational surface and cost—you're paying to keep standby instances running.
The node group constraint
Node groups are the fundamental architectural limit. You define instance types when creating the group. The autoscaler can only pick from those shapes. In practice this produces two failure modes in the same cluster:
- Oversized nodes: the only available groups contain large instances. A pod requesting 500m CPU and 512Mi memory triggers a 16-vCPU node launch, leaving most of that node idle.
- Type mismatch: a workload needs a memory-optimised instance; no memory-optimised node group exists; the pod stays pending indefinitely even after CA tries to scale up.
The bin-packing scorer inside CA is competent, but it can only work within the solution space you gave it. If your node group inventory is stale, too coarse, or was chosen for a different workload pattern, the scorer optimises within a constrained reality.
A common workaround is creating many node groups—one per instance type you might want. This works but introduces operational drag: each group needs its own configuration, IAM setup, and lifecycle management. Teams running 12 or more node groups for Spot diversification alone are common, and every additional group is another thing to audit, update, and rotate AMIs for.
Karpenter: architecture and how it differs
Karpenter (v1.x, CNCF-graduated in 2024) replaces the node group abstraction entirely. Instead of managing pre-defined groups, it talks directly to the cloud provider's instance fleet API and makes per-pod provisioning decisions in real time.
The core CRDs are NodePool and NodeClaim (the v1 API, which replaced the earlier Provisioner CRD). A NodePool describes what instances are acceptable: instance families, CPU architectures, availability zones, capacity types (Spot vs on-demand), and any other node properties. Karpenter watches for Unschedulable pods, evaluates which NodePool can satisfy them, selects the specific instance type that best fits the pending workload, and calls the EC2 Fleet API directly to launch it.
- 01
Pod goes Pending
Scheduler marks pod Unschedulable; no existing node satisfies the filter phase.
- 02
Karpenter evaluates
Reads the pending pod's requests, node affinity, tolerations, and topology spread constraints. Finds the lowest-cost instance type across all NodePool constraints that satisfies them.
- 03
Fleet API call
Calls EC2 Fleet API (or equivalent GCP/Azure provider API) directly—no ASG intermediary—requesting exactly the instance type selected. A NodeClaim object is created to track the expected node.
- 04
Node joins cluster
Instance boots, kubelet registers with the control plane. The node is typically Ready in 45-90 seconds. The scheduler binds the pending pod.
- 05
Consolidation loop
Continuously evaluates whether workloads on underutilised nodes can be rescheduled onto fewer nodes. When an opportunity is found and disruption budgets allow, Karpenter cordons, drains, and terminates the emptied node.
Source: Karpenter documentation, karpenter.sh, 2025
The direct Fleet API call is why Karpenter consistently outperforms CA on scale-up latency. Production benchmarks from ScaleOps (2026) and chkk.io (2024) report Karpenter bringing nodes to Ready in roughly 45 to 90 seconds for cold launches, versus 3 to 4 minutes for CA with standard ASG launches. For burst workloads—CI runners, ML inference jobs, or morning traffic ramps—that difference is the gap between hitting SLO and missing it.
Consolidation: the more important feature
Faster scale-up gets attention, but consolidation is where sustained cost reduction lives.
Karpenter's consolidation controller runs continuously. It simulates whether the workloads on a given node could be rescheduled onto other existing nodes without violating any constraints. When it finds a consolidation opportunity and Pod Disruption Budgets allow it, it cordons the target node (marking it unschedulable to new pods), drains it (evicting existing pods), and deletes the node object. The cloud instance terminates.
In high-churn environments this is significant. A cluster that runs 40 nodes under peak morning load might consolidate to 14 nodes by 2 AM if Karpenter is running an active consolidation policy. CA, in the same cluster, would still be running 40 nodes—because none of them are completely empty, just increasingly idle. The difference, at $0.192/hour per m5.xlarge, is roughly $11,000/month.
Head-to-head: what the comparison actually looks like
Cluster Autoscaler
- Node groups (ASG/MIG/VMSS) pre-defined; instance types fixed at group creation
- Scale-up via ASG resize request; cold launch 3-4 minutes to Ready
- No active consolidation; waits for nodes to become naturally empty
- Bin-packing bounded by available node group shapes
- Every major cloud provider; mature, well-understood failure modes
- Spot diversification requires a separate node group per instance type
Karpenter
- NodePools define constraints; instance type selected per-pending-pod at runtime
- Direct Fleet API call; node Ready in ~45-90 seconds
- Active consolidation: repacks workloads and terminates underutilised nodes continuously
- Selects cheapest fitting instance from hundreds of available shapes
- AWS first-class; GCP and Azure via community providers (karpenter-provider-gcp, karpenter-provider-azure)
- Spot diversification in a single NodePool; Fleet API handles instance fallback automatically
| Dimension | Cluster Autoscaler | Karpenter | |---|---|---| | Node type selection | Fixed at group creation | Per-pod, from NodePool constraints | | Scale-up latency (cold launch) | 3-4 min (ASG) | 45-90 sec (Fleet API) | | Scale-down mechanism | Idle detection only | Idle detection + active consolidation | | Spot diversification | One node group per instance type | Single NodePool; Fleet API handles fallback | | Interruption handling | Lifecycle hook + ASG | Native SQS interruption handler | | Cloud provider support | All major providers | AWS native; GCP/Azure community | | CNCF maturity | Graduated | Graduated (2024) | | Operational complexity | Low | Medium |
Choose CA when: your cluster is small and workloads are predictable, you need cloud-agnostic support, or you're on a provider where Karpenter community support is immature. CA is also the right default if your team lacks the bandwidth to tune disruption budgets correctly—misconfigured Karpenter consolidation causes avoidable evictions.
Choose Karpenter when: you have high-variability workloads, meaningful Spot spend, or a cluster where idle capacity accumulates between traffic peaks. The consolidation loop alone typically justifies the migration cost for clusters with more than a few hundred nodes or a bill north of $20k/month in compute.
Resource requests: the prerequisite that breaks both tools
The Cast.ai 2026 data makes the stakes concrete: average CPU utilisation at 8% implies that requests are set at roughly 12x actual use across the typical cluster. This happens because setting conservative requests feels safe—and in one sense it is, because an under-requested pod risks CPU throttling. But the autoscaler treats requests as fact. If pods collectively request 80 vCPU on a node, the autoscaler won't schedule more pods onto that node even if actual use is 6 vCPU. The node is "full" by request accounting while sitting largely idle.
Getting requests honest without causing incidents
Start with VPA in recommendation mode. It records actual usage over a configurable history window and produces ContainerRecommendations without changing anything in the cluster:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off"After several days of representative traffic, read the recommendations:
kubectl get vpa api-server-vpa -n production \
-o jsonpath='{.status.recommendation.containerRecommendations[0]}'The target field gives a data-driven request that VPA computed from observed usage. The lowerBound and upperBound fields give you a confidence interval. Set requests to the target value, adjust limits to 2x requests as a starting point, deploy to staging, observe for a cycle, then promote.
A practical heuristic that holds across most web-tier workloads: set CPU requests at your 90th-percentile 5-minute average and CPU limits at 2-3x that. Set memory requests at your 95th-percentile usage and memory limits equal to requests plus a 20% buffer. Memory OOMKills are painful and silent; a slight memory overprovision is usually the right trade. CPU throttling is recoverable and visible in metrics; CPU oversizing at 12x is not recoverable from a cost perspective.
Only after requests are within roughly 1.5-2x of actual use does Karpenter's consolidation and CA's bin-packing produce meaningful efficiency gains. Without accurate requests, both tools are doing arithmetic on fiction.
Karpenter NodePool configuration: the knobs that matter
A minimal NodePool for mixed on-demand and Spot across current-generation compute instances:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["3"]
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 1m
budgets:
- nodes: "10%"
- schedule: "0 8 * * 1-5"
duration: 4h
nodes: "0"The disruption.budgets section is where most operators run into trouble in production. The default configuration is unconstrained, which means Karpenter will consolidate aggressively at any time—including during peak business hours when consolidation-triggered evictions will affect live traffic. The schedule budget above freezes consolidation during the 8 AM to 12 PM window on weekdays, using a standard cron expression. Adjust the schedule and duration to match your actual traffic pattern.
consolidateAfter: 1m tells Karpenter to wait one minute of sustained underutilisation before acting on a consolidation opportunity. Too short and you get churn—nodes terminating while workloads are between bursts. Too long and idle capacity lingers. For most web-tier deployments, 1-5 minutes is the right range. For overnight batch workloads where rapid consolidation is the goal, 30 seconds or less is reasonable.
The instance-category constraint spans general-purpose (m), compute-optimised (c), and memory-optimised (r) families. Allowing all three lets Karpenter choose the right shape for each pending pod. Restricting to a single family reintroduces the node group problem: the autoscaler optimises within a constrained solution space.
The EC2NodeClass specifies the AMI family, IAM role, subnets, and security groups:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2023
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
encrypted: trueThe karpenter.sh/discovery tag is how Karpenter finds the right subnets and security groups for your cluster. Set this tag on the relevant AWS resources during cluster setup. Missing it is the most common reason Karpenter fails silently on first deployment.
Spot instances: where Karpenter genuinely earns its keep
Spot instances typically run 60-90% cheaper than equivalent on-demand capacity. AWS publishes Spot interruption frequency data by instance type; well-diversified fleets across multiple instance families see interruption rates averaging around 5% over a 3-month window, according to published AWS Spot Instance Advisor data.
CA can use Spot, but requires a separate node group per instance type to achieve diversification. A team running c5.xlarge, c5a.xlarge, m5.xlarge, and m5a.xlarge as Spot candidates needs four node groups. Add availability zones and that multiplies further. Managing 12+ node groups for a single Spot pool is operationally tedious and error-prone—AMI rotations, IAM updates, and launch template changes must be applied to each group individually.
Karpenter handles this in a single NodePool with a flexible requirements block. When it calls the Fleet API, it passes a prioritised list of instance types that satisfy the pending pod. If the first-choice Spot instance is unavailable or interrupted, the Fleet API selects the next option from the list. This decision happens inside the API call rather than through a retry loop, which means Karpenter's response to a Spot interruption is measured in seconds rather than the minutes required for an ASG to cycle through its retry logic.
The SQS-based interruption handler is a prerequisite for running Spot in production with Karpenter. Configure it via the Helm chart:
helm upgrade karpenter oci://public.ecr.aws/karpenter/karpenter \
--namespace kube-system \
--set "settings.interruptionQueue=my-cluster-karpenter"This handler subscribes to EC2 Spot interruption notifications via EventBridge and SQS. AWS publishes interruption notices 2 minutes before termination. Karpenter uses that window to cordon and drain the node, allowing pods to reschedule before the instance disappears. Without this handler, Spot interruptions result in hard pod kills.
Average Kubernetes CPU utilisation across production clusters: 8%. The other 92% is provisioned headroom your bill pays for whether workloads use it or not.
A worked example: what this costs at real scale
Take a cluster that is representative of what the Cast.ai data describes. Forty nodes, m5.xlarge (4 vCPU / 16 GB RAM), running in us-east-1. On-demand price: approximately $0.192 per hour.
Baseline (no consolidation, 8% CPU utilisation):
40 nodes at $0.192/hr x 730 hours = approximately $5,610/month
At 8% utilisation, the actual workload CPU demand corresponds to 3.2 vCPU average across the cluster, or roughly 0.8 equivalent nodes of actual compute. Even with generous headroom at 3x peak-to-average ratio, a well-configured cluster should run this workload on 12-15 nodes.
After right-sizing requests and enabling Karpenter consolidation (on-demand):
14 nodes at $0.192/hr x 730 = approximately $1,963/month. Savings: $3,647/month (65%).
After adding Spot diversification (70% Spot, 30% on-demand):
10 Spot nodes at approximately $0.058/hr (average m5.xlarge Spot across AZs) + 4 on-demand at $0.192/hr = approximately $988/month. Savings from baseline: 82%.
These are not marketing projections—they are arithmetic applied to published instance prices and the Cast.ai utilisation figures. The actual reduction depends on workload variance, PDB constraints, the aggressiveness of consolidation tuning, and how much headroom you need to absorb traffic spikes without latency impact. A realistic target for a team making a focused pass at this problem is a 40-65% reduction in node compute spend. The 80%+ range is achievable for workloads with low variance and long off-peak periods.
8%
Average CPU utilization
before optimization (2026)
69%
CPU over-provisioning rate
year-over-year increase
45-90s
Karpenter scale-up
vs 3-4 min for CA
99.94%
Clusters over-provisioned
2025 benchmark cohort
Source: Cast.ai, 2025 Cost Benchmark & 2026 State of Kubernetes Optimization reports; ScaleOps, 2026
Real-world pitfalls and how to avoid them
Pitfall 1: Consolidation evicting slow-starting workloads
Karpenter respects Pod Disruption Budgets but does not know your application's startup latency. If a pod takes 3 minutes to reach Ready and Karpenter initiates a consolidation event while a previous replica is still initialising, you can briefly run below your target replica count.
Fix: set a PDB with minAvailable matching your SLO, and configure startupProbe with realistic failureThreshold and periodSeconds values that reflect actual startup time. Karpenter will not drain a node if doing so would violate the PDB.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: api-serverPitfall 2: Topology spread constraints blocking expected consolidation
If your pods require spread across 3 availability zones and your workload scales down to 3 pods, Karpenter cannot consolidate below 3 nodes. This is correct behaviour, not a bug—but it surprises teams expecting heavier consolidation than the constraints allow.
Fix: use whenUnsatisfiable: ScheduleAnyway for advisory spread preferences. Reserve DoNotSchedule for workloads where zone isolation is a genuine availability requirement. Audit your topology spread constraints before debugging consolidation behaviour.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-serverPitfall 3: Missing terminationGracePeriodSeconds for long-running workloads
Kubernetes sends SIGTERM during pod eviction and force-kills after terminationGracePeriodSeconds (default 30 seconds). For pods handling database transactions, in-flight HTTP requests, or batch segments, 30 seconds may not be sufficient to finish cleanly.
Fix: set terminationGracePeriodSeconds to the actual drain time for your workload. For long-running batch jobs, use a preStop lifecycle hook to signal the job to stop accepting new work units before the SIGTERM arrives.
Pitfall 4: Over-restricting NodePool to a single instance type
Teams migrating from CA sometimes carry over the "one node group = one instance type" pattern into Karpenter NodePool requirements. This defeats the primary architectural advantage. A NodePool restricted to m5.xlarge only will never select a c5.large when that is the right fit for a compute-bound pending pod.
Fix: allow multiple instance categories and generations. Start broad (c, m, r; generation Gt 3) and narrow only if specific workloads have hard hardware requirements. Let Karpenter's bin-packing choose the cheapest satisfying type.
Pitfall 5: CA's skip-nodes-with-system-pods flag
CA's default configuration (--skip-nodes-with-system-pods=true) prevents scale-down of any node running a kube-system pod. DaemonSets such as aws-node, kube-proxy, and your CNI plugin run on every node by definition—which means every node in the cluster is effectively ineligible for scale-down. The cluster grows but never shrinks.
Fix: set --skip-nodes-with-system-pods=false. Annotate DaemonSet pods that are safe to evict:
kubectl patch ds aws-node -n kube-system \
-p '{"spec":{"template":{"metadata":{"annotations":{"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"}}}}}'Apply this annotation to every DaemonSet in kube-system that doesn't hold stateful data or external locks. The CA will then evaluate actual workload pods when deciding whether a node is safe to remove.
Pitfall 6: No warm pool strategy for latency-sensitive scale-up
Even Karpenter's 45-90 second scale-up can be too slow for applications with strict cold-start SLOs. If a traffic burst hits and no nodes are available, 60 seconds of pending pods translates to 60 seconds of degraded response times or dropped requests.
Fix: use overprovisioning at a controlled level—deploy a low-priority "placeholder" Deployment with PriorityClass: system-cluster-critical displaced, or use a dedicated low-priority class. When real workloads arrive, they preempt the placeholder pods, which go Pending and trigger Karpenter to provision the new capacity that actual traffic then uses. This keeps one or two spare nodes available without paying for permanently idle headroom.