Autoscaling for traffic spikes: beyond a single HPA

The platforms that fall over on launch day usually had autoscaling. They had one Horizontal Pod Autoscaler, tuned against a steady-state baseline, with default scale-up policies that add four pods per fifteen seconds — meeting a step-change in load it was never tested against. The failure is not surprising once you trace the event chain: HPA waits for the CPU metric to climb past threshold, requests new pods, finds no allocatable node capacity, waits for a node to provision, waits for the container image to pull, waits for the readiness probe — and by the time the first new pod enters service, five to eight minutes may have elapsed. That window is your outage.

Spike resilience is not a single setting. It is four interlocking layers — pod scaling, pod right-sizing, node provisioning, and event-driven scaling — each covering the blind spot of the one above it, plus the baseline defenses of caching and load shedding, plus the discipline of actually testing the full stack before the spike arrives. This post works through each layer in technical detail, flags the specific pitfalls that bite teams in production, and ends with a decision framework for which layers apply to which workload types.

Kubernetes cluster utilization — industry average

Average CPU utilization

across sampled clusters

20%

Average memory utilization

across sampled clusters

69%

CPU overprovisioning rate

year-on-year increase from 40%

Source: Cast.ai State of Kubernetes Optimization, 2026

The figures above are consistent across Cast.ai's annual benchmarks, which sample thousands of clusters across AWS, GCP, and Azure. The gap between what is provisioned and what is used is the single most important framing for spike handling: if your baseline pods are over-provisioned, your HPA is scaling a padded baseline. You are paying for capacity you never use on quiet days, and the inflated request values make utilization-based metrics sluggish — meaning HPA reacts later than it should when load actually arrives.

Layer 1 — HPA tuned for the shape of your load

HPA is necessary but not sufficient. The problems with a default HPA configuration under spike conditions are specific and fixable.

The signal problem. CPU utilization is a lagging indicator. By the time your pods are saturated, user-visible symptoms — high latency, rising error rates — have been present for tens of seconds. For spike workloads, a better signal is requests-per-second from a Prometheus adapter, or P95 latency from a custom metrics source. These respond to load onset faster and allow HPA to act before pods are overwhelmed rather than after.

The scale-up speed problem. The default scale-up policy adds 4 pods every 15 seconds, or doubles every 15 seconds — whichever is larger. For gradual ramp traffic this is acceptable. For a step-change — a product launch, a marketing email drop, a viral moment — you need the doubling behavior to persist through the surge, and you need the stabilization window on scale-up eliminated entirely so HPA reacts on the first observation above threshold rather than an averaged one:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 80
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "500"      # target: 500 RPS per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # act immediately on first reading above threshold
      policies:
        - type: Percent
          value: 100                     # double pod count every window
          periodSeconds: 30
        - type: Pods
          value: 10                      # or add 10 pods — whichever is larger
          periodSeconds: 30
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300    # 5-minute window before scaling down
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60             # shed at most 20% of pods per minute on the way down

Three things here that differ from the defaults. First, stabilizationWindowSeconds: 0 on scale-up means HPA acts on the first observation that exceeds threshold — no averaging delay. Second, selectPolicy: Max takes the larger of the two scale-up policies, so a large fleet doubles while a small fleet gets at least 10 pods added immediately. Third, minReplicas: 3 is deliberate: a user-facing service at 2 replicas loses 50% of its capacity during a single pod restart; at 3 replicas the loss is 33% and the remaining pods can typically absorb the difference.

The maxReplicas ceiling problem. Teams set maxReplicas conservatively and rarely revisit it. On launch day, HPA hits the ceiling while load keeps climbing. Treat maxReplicas as a circuit-breaker, not a steady-state target — size it at the volume you would genuinely be willing to serve at full compute cost, and verify that your namespace ResourceQuota and available node capacity can actually support that number. Silent quota failures are covered in the pitfalls section below.

Layer 2 — VPA in recommendation mode: right-sizing what HPA scales

The Vertical Pod Autoscaler (VPA) is the least-deployed component in the autoscaling stack and one of the highest-leverage in preparation. Running it in Recommendation mode against real workloads for two to four weeks before a major traffic event tells you what your requests and limits should actually be based on observed usage — before you need to know.

The reason this matters for spike handling directly: HPA scales horizontally based on the ratio of current resource utilization to the target threshold. If your CPU request is set to 1 CPU but the pod never actually uses more than 200 millicores under real load, Kubernetes reports 20% utilization even when the service is genuinely busy. The over-provisioned request masks the actual saturation signal, and HPA sits idle while latency climbs. VPA surfaces the real baseline.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"       # Recommendation only — do not auto-evict in production
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 2
          memory: 2Gi

With updateMode: "Off", VPA computes and stores recommendations without touching pods. Review the recommendations before any major event (kubectl describe vpa api-server-vpa) and apply them manually in your deployment manifests. The combination of accurate requests and HPA on a custom metric — RPS or P95 latency rather than CPU percentage — is substantially more responsive than CPU-percentage HPA against an oversized request.

One hard constraint: do not run VPA in Auto or Recreate mode on the same pods that HPA manages. The two controllers can conflict — VPA tries to resize pods while HPA is simultaneously creating and deleting them, and the resulting evictions interrupt service at the worst possible moment. The safe production pairing is VPA on recommendations, HPA on a custom metric, and the two reviewed in combination before an anticipated event.

Layer 3 — node provisioning: Cluster Autoscaler versus Karpenter

CPU provisioned vs CPU actually used — Kubernetes clusters industry-wide

CPU provisioned (requests)100%

CPU actually used~8%

Source: Cast.ai State of Kubernetes Optimization, 2026

Pods that HPA creates but cannot schedule are not serving traffic. The most common reason new pods sit in Pending is that no node has sufficient allocatable capacity — meaning the pod's requests exceed what is free across every node. This is the node-provisioning layer, and its two primary implementations have very different latency profiles for spike scenarios.

Cluster Autoscaler watches for Pending pods, identifies which configured node group could satisfy them, and adds a node to that group via the cloud provider's autoscaling API (ASG on AWS, MIG on GCP). The ASG must then launch an instance, complete the instance boot sequence, run the bootstrap script, register with the Kubernetes API, and pass the node readiness check. End-to-end, this typically takes 3 to 5 minutes depending on AMI size, instance type availability, and bootstrap script length. For gradual load growth this latency is acceptable. For a step-change spike it is a multi-minute gap between HPA requesting pods and those pods entering service.

Karpenter takes a fundamentally different approach: it calls the cloud provider's fleet API directly (EC2 Fleet on AWS), bypassing the autoscaling group layer entirely. According to CNCF and AWS documentation following the Karpenter v1.0 release in late 2024, node provisioning typically completes in under 60 seconds, often closer to 30 seconds in warm-region configurations. Karpenter also consolidates idle capacity continuously — the WhenEmptyOrUnderutilized disruption policy bin-packs workloads onto fewer nodes and terminates underutilized ones on an ongoing basis, reducing the idle waste captured in Cast.ai's benchmarks.

traditional

Cluster Autoscaler

Watches Pending pods, triggers ASG node addition
3–5 minute end-to-end provisioning including ASG and boot
Node groups require manual instance-type selection per group
Limited bin-packing — one node group at a time
Scale-down requires drain, cordon, and delete cycle

modern

Karpenter

Calls EC2 Fleet API directly — no ASG intermediary
Sub-60-second provisioning in most AWS configurations
NodePool accepts a family of instance types; Karpenter picks the best fit
Bin-packs across instance types and AZs in a single scheduling pass
Disruption budgets control consolidation continuously

Node provisioner comparison — AWS EKSSource: CNCF blog, Nov 2024; AWS EKS documentation

A minimal Karpenter NodePool configured for a spike-tolerant API tier:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: api-tier
spec:
  template:
    metadata:
      labels:
        workload: api-tier
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: api-nodes
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m6i.xlarge", "m6a.xlarge", "m7i.xlarge", "m7a.xlarge"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]
  limits:
    cpu: "400"                        # hard ceiling — prevents runaway scale-up
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

For known events — product launches, scheduled batch imports, marketing sends — pre-provision warm capacity. The simplest approach: inflate your Deployment replica count or temporarily raise minReplicas two to four hours before the expected load. Karpenter provisions the nodes, the nodes warm, and HPA takes over from there. This eliminates the cold-start provisioning cost during the critical first minutes when error budget is depleting fastest.

Layer 4 — event-driven scaling with KEDA

HPA on CPU or even on RPS is a reactive instrument: it scales after load arrives and metrics climb. For workloads driven by queues, streams, or scheduled events, queue depth or consumer lag is a better signal — it lets you scale workers before the backlog becomes user-visible latency on the downstream consumer.

KEDA (Kubernetes Event-Driven Autoscaling) is a graduated CNCF project that bridges this gap. It adds a custom metrics server and a library of scalers for common event sources: SQS, Kafka, RabbitMQ, Redis Streams, Azure Service Bus, Prometheus, and scheduled cron windows among others. KEDA's ScaledObject wraps your Deployment or StatefulSet and drives a standard HPA under the hood — so the pod scheduling and node provisioning path remains unchanged.

A KEDA scaler for an SQS-backed order processor:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 2
  maxReplicaCount: 60
  pollingInterval: 10          # check queue depth every 10 seconds
  cooldownPeriod: 120          # wait 2 minutes of idle before scaling to minReplicaCount
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
        queueLength: "50"      # target: 50 messages per active worker
        awsRegion: us-east-1
        scaleOnInFlight: "true"

With queueLength: "50", KEDA targets one worker pod per 50 in-flight messages. The arithmetic is direct: if an intake event causes queue depth to climb from 100 messages to 2,000, KEDA requests 40 worker pods (2,000 divided by 50). Starting from minReplicaCount: 2, that is 38 additional pods requested in a single scaling decision — independently of CPU utilization, which may barely have moved if the workers are I/O-bound network callers.

The cron scaler is underused for predictable traffic patterns. A marketing email blast that sends every Tuesday at 09:00, a scheduled nightly import, a known product launch window: these are cases where the load is known in advance and you can pre-warm replicas before any queue depth or metric threshold fires:

triggers:
  - type: cron
    metadata:
      timezone: America/New_York
      start: "50 8 * * 2"         # 8:50am Tuesday — 10 minutes before the blast
      end: "30 10 * * 2"
      desiredReplicas: "20"
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/emails
      queueLength: "100"
      awsRegion: us-east-1

The cron trigger holds 20 warm replicas for the window. Once the queue fills and the queue-depth trigger fires, KEDA hands control to that trigger and scales further as needed. The cron trigger ensures nodes and pods are provisioned and warm before the first message arrives.

Defend the baseline: caching and load shedding

Scaling compute to absorb traffic you could have served from cache is the most expensive form of capacity planning. A request answered by a CDN edge node never reaches your cluster. A request answered by Redis never hits your database. Every request that flows through to your application pods during a spike is a request you chose not to handle more cheaply.

CDN for static and quasi-static content. Product pages, asset files, and anything that can tolerate a TTL of 30 seconds or more belongs behind a CDN. A flash sale landing page cached globally at 50 CDN edges is not a Kubernetes scaling problem on launch day — it is a static file problem. Most CDN providers serve millions of requests per second from edge infrastructure that costs a fraction of compute scaling.

In-process caching for hot read paths. The database connection pool is the most common silent failure mode during a spike — not CPU saturation, but the PostgreSQL or MySQL connection ceiling being hit when 40 newly-provisioned pods each try to open their default connection pool. Cache the top read queries with a TTL that matches your data freshness requirements. A 5-second TTL on a product detail endpoint eliminates the vast majority of database fan-out during a traffic spike at the cost of acceptable staleness.

Rate limiting at the ingress layer. NGINX Ingress and Envoy both support per-client rate limiting before traffic reaches pods:

# NGINX Ingress annotations
nginx.ingress.kubernetes.io/limit-rps: "20"
nginx.ingress.kubernetes.io/limit-connections: "5"
nginx.ingress.kubernetes.io/limit-burst-multiplier: "3"

During a spike driven by a bot, a misconfigured integration, or an unexpected viral moment, rate limiting protects legitimate users without requiring any additional compute. The limit is enforced at the ingress layer, not in application code, so it costs nothing in pod CPU.

Circuit breakers for downstream dependencies. Use a circuit-breaking library on calls to downstream services that are not in your control — payment providers, third-party APIs, internal services with separate SLOs. If a downstream service is degraded, fail fast and return a cached or degraded response rather than holding goroutines or threads open waiting for a timeout, exhausting your connection pool, and cascading the failure upstream. Libraries such as resilience4j (JVM) and go-resiliency (Go) expose circuit-breaker and bulkhead primitives with minimal overhead.

Load testing: from hypothesis to evidence

An autoscaling configuration you have never tested under load is a hypothesis. The specific failure mode for spike scenarios is a step-change: traffic goes from near-zero to many times peak within seconds, with no warm-up period for the autoscaler. That is the test most teams skip.

Load test lifecycle for spike resilience

01
Baseline measurement
Run steady-state load at 50% of expected peak. Record HPA replica count, P95 latency, CPU per pod, and database connection utilization. This is your control measurement.
02
Ramp to expected peak
Increase load to 100% expected peak over 5 minutes. Observe HPA reaction lag, node provisioning delay, and pod pending duration. Measure time from metric threshold breach to first new pod reaching Ready.
03
Step-change test
Drop load to baseline, then instantly apply 2x expected peak load with no ramp. This is the launch-day scenario. Measure error rate during the gap, pod pending time, and time-to-recovery.
04
Failure injection
While at peak load, kill 30% of pods using kubectl delete. Verify PodDisruptionBudgets hold, Karpenter or Cluster Autoscaler replaces nodes, and recovery completes within your error budget.
05
Endurance run
Hold 80% of peak for 30 minutes. Surface connection pool exhaustion, memory leaks under sustained concurrency, and any node-level resource pressure that does not show up in short tests.

Source: ClimsTech Engineering practice

Tools worth knowing: Grafana k6 for scripted, TypeScript-based scenarios with built-in Prometheus metrics export and a clean ramp/spike DSL; Locust for Python-native scenarios where your team has Python fluency and needs dynamic behavior logic; Artillery for declarative YAML-first scenarios with minimal scripting overhead. All three can generate realistic traffic shapes and export metrics to your existing observability stack.

What to instrument during a load test beyond P50/P95/P99 latency:

Pod pending time — the delta between pod creation and pod reaching Ready. If this exceeds 90 seconds, node provisioning or image pull is your bottleneck.
Database connection pool saturation — pg_stat_activity rows approaching max_connections. A full pool is invisible at the HTTP layer until requests start queuing behind a connection wait.
HPA metric scrape lag — how quickly does your custom metric reflect injected load? A 60-second Prometheus scrape interval means a 60-second blind spot. Match the scrape interval to your acceptable reaction time.
Downstream dependency error rates — a third-party API that returns 200 OK at steady state may start rate-limiting you at 10x normal volume. Discover this in the load test, not in production, with a real customer in front of you.

"It should scale" and "it scaled — here is the graph" are very different conversations to have with a CTO the night before a major launch.

— ClimsTech Engineering

We have built and load-tested Kubernetes platforms to 100,000 concurrent connections — not because that is a typical workload, but because the only way to know where a system bends is to push it there in a controlled environment before the spike does it for you.

Seven pitfalls and their fixes

Failure mode

Throughput rising while stability degrades — the signature of a provisioning lag that autoscaling cannot close fast enoughSource: ClimsTech Engineering

1. PodDisruptionBudget blocking scale-down. A PDB with minAvailable: 100% prevents the Cluster Autoscaler or Karpenter from draining nodes — idle capacity accumulates, costs climb, and autoscaling signals become unreliable. Fix: set minAvailable to the minimum replica count that can serve traffic safely, typically 50–70% of your steady-state count, and never at 100% unless you have a hard contractual reason.

2. ResourceQuota ceiling hit silently. Namespace-level ResourceQuota objects cap total CPU and memory requests. When HPA tries to create pods that would exceed the quota, pod creation fails — HPA reports it has requested the pods but they never materialise, and no alert fires. Fix: set maxReplicas to a value consistent with your namespace quota, and add an alert on kube_resourcequota_used / kube_resourcequota_hard exceeding 80%.

3. Image pull latency adding 60–90 seconds to scale-up. A 1 GB container image on a cold node adds 60 to 90 seconds of pull time before the container can start. This is dead time — the pod sits in ContainerCreating while traffic queues. Fix: use lean base images (distroless or Alpine derivatives where the application supports it), enable image pre-pulling via a DaemonSet on warm nodes, or build Karpenter custom AMIs with images pre-cached at the OS layer.

4. topologySpreadConstraints creating unschedulable pods. Spread constraints that require pods across all three availability zones fail when one zone lacks available nodes — and during a spike, Karpenter or the Cluster Autoscaler may provision nodes unevenly across zones. Fix: set whenUnsatisfiable: ScheduleAnyway as the fallback for non-critical spread requirements, reserving DoNotSchedule only for deployments where AZ isolation is a hard reliability or compliance requirement.

5. KEDA polling interval too slow for burst workloads. The default KEDA polling interval is 30 seconds. For a workload where a queue goes from 0 to 10,000 messages in 10 seconds — an intake event, a batch trigger — this means a 30-second window where workers are not scaling. Fix: set pollingInterval: 10 (or lower for critical paths), and pair queue-depth scaling with a cron trigger for predictable events so warm capacity exists before the queue trigger needs to fire.

6. Liveness probe killing pods during startup. If the liveness probe fires before the application has fully initialized — loaded caches, established database connections, completed startup migrations — Kubernetes kills and restarts the pod in a loop, exactly when you need it to reach Ready under high load. Fix: set initialDelaySeconds on the liveness probe to a conservative value (30–60 seconds for most JVM or interpreted-runtime services), and use a separate startupProbe with a longer failureThreshold to gate liveness checks entirely until bootstrap completes.

7. HPA thrashing on noisy metrics. A custom metric that oscillates around the target threshold — a P95 latency metric with high variance, or an RPS counter with a short averaging window — drives constant scale-up and scale-down events, increasing pod churn and reducing average availability. Fix: ensure your baseline metric value sits at least 15–20% below the HPA target threshold under normal load (beyond the default 10% tolerance band), use a 1-minute rolling average rather than a point-in-time value at the Prometheus recording rule layer, and widen the scale-down stabilization window to at least 5 minutes.

Choosing your layers: a decision framework

Not every workload needs every layer. The table below maps workload characteristics to the minimum layers that provide reliable spike resilience:

| Workload type | HPA | VPA recos | Karpenter | KEDA | Cache / shed | |---|---|---|---|---|---| | Stateless HTTP API, gradual ramp | Required | Strongly recommended | Recommended | Optional | Required for hot reads | | Stateless HTTP API, step-change traffic | Required (spike config) | Required | Required | Cron trigger useful | Required | | Queue worker, unpredictable depth | Optional | Recommended | Required | Required (queue scaler) | Depends on downstream | | Queue worker, scheduled pattern | Optional | Recommended | Recommended | Required (cron + queue) | Depends on downstream | | Batch / data pipeline | Not applicable | Recommended | Required | Required | Not applicable | | Stateful (StatefulSet, primary DB) | Not applicable | Recommended | Use with PDB caution | Not applicable | Required at app layer |

The general principle: use HPA for user-facing HTTP tiers with custom metrics, KEDA for queue-backed workers, and Karpenter for the node layer regardless of workload type. VPA recommendations cost nothing to collect once deployed and consistently surface request over-provisioning — the root cause behind the industry-average 8% CPU utilization figure that makes every other scaling layer less effective than it should be.

What to remember

A single HPA with default settings cannot handle step-change traffic — configure aggressive scale-up policies explicitly, use custom metrics (RPS or P95 latency rather than CPU), and set minReplicas for fault tolerance rather than traffic absorption.
Run VPA in Recommendation mode before any major event — over-provisioned resource requests mask actual saturation and make CPU-based HPA signals unreliable regardless of how well the scale-up policy is tuned.
Node provisioning is the most commonly unaddressed bottleneck — Karpenter cuts provisioning from 3–5 minutes to under 60 seconds by calling the cloud fleet API directly; pre-provisioning warm nodes before known events eliminates cold-start lag entirely.
KEDA scales on the signal that actually predicts load — queue depth and consumer lag react before user-visible symptoms appear; pair a cron trigger with an event-depth trigger for workloads with predictable onset times.
Cache hot read paths and rate-limit at the ingress before scaling more compute — a CDN hit costs nothing, a Redis hit costs microseconds, and a new pod takes minutes to provision and warm.
Test the step-change scenario specifically — zero to 2x peak with no ramp — because gradual ramp tests miss provisioning lag, image pull latency, and connection pool exhaustion that only appear during an instant spike.
Instrument pod pending time, database connection pool saturation, and HPA metric scrape lag during load tests — the actual failure mode is usually not where the architecture review predicted it would be.
Set alerts on ResourceQuota utilization above 80% and pod pending time above 90 seconds — silent failures in the autoscaling control plane are more dangerous than visible errors because they look like capacity that exists but does not.