The platforms that fall over on launch day usually had autoscaling. They had one Horizontal Pod Autoscaler, tuned against a steady-state baseline, with default scale-up policies that add four pods per fifteen seconds — meeting a step-change in load it was never tested against. The failure is not surprising once you trace the event chain: HPA waits for the CPU metric to climb past threshold, requests new pods, finds no allocatable node capacity, waits for a node to provision, waits for the container image to pull, waits for the readiness probe — and by the time the first new pod enters service, five to eight minutes may have elapsed. That window is your outage.
Spike resilience is not a single setting. It is four interlocking layers — pod scaling, pod right-sizing, node provisioning, and event-driven scaling — each covering the blind spot of the one above it, plus the baseline defenses of caching and load shedding, plus the discipline of actually testing the full stack before the spike arrives. This post works through each layer in technical detail, flags the specific pitfalls that bite teams in production, and ends with a decision framework for which layers apply to which workload types.
8%
Average CPU utilization
across sampled clusters
20%
Average memory utilization
across sampled clusters
69%
CPU overprovisioning rate
year-on-year increase from 40%
Source: Cast.ai State of Kubernetes Optimization, 2026
The figures above are consistent across Cast.ai's annual benchmarks, which sample thousands of clusters across AWS, GCP, and Azure. The gap between what is provisioned and what is used is the single most important framing for spike handling: if your baseline pods are over-provisioned, your HPA is scaling a padded baseline. You are paying for capacity you never use on quiet days, and the inflated request values make utilization-based metrics sluggish — meaning HPA reacts later than it should when load actually arrives.
Layer 1 — HPA tuned for the shape of your load
HPA is necessary but not sufficient. The problems with a default HPA configuration under spike conditions are specific and fixable.
The signal problem. CPU utilization is a lagging indicator. By the time your pods are saturated, user-visible symptoms — high latency, rising error rates — have been present for tens of seconds. For spike workloads, a better signal is requests-per-second from a Prometheus adapter, or P95 latency from a custom metrics source. These respond to load onset faster and allow HPA to act before pods are overwhelmed rather than after.
The scale-up speed problem. The default scale-up policy adds 4 pods every 15 seconds, or doubles every 15 seconds — whichever is larger. For gradual ramp traffic this is acceptable. For a step-change — a product launch, a marketing email drop, a viral moment — you need the doubling behavior to persist through the surge, and you need the stabilization window on scale-up eliminated entirely so HPA reacts on the first observation above threshold rather than an averaged one:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 80
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500" # target: 500 RPS per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # act immediately on first reading above threshold
policies:
- type: Percent
value: 100 # double pod count every window
periodSeconds: 30
- type: Pods
value: 10 # or add 10 pods — whichever is larger
periodSeconds: 30
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300 # 5-minute window before scaling down
policies:
- type: Percent
value: 20
periodSeconds: 60 # shed at most 20% of pods per minute on the way downThree things here that differ from the defaults. First, stabilizationWindowSeconds: 0 on scale-up means HPA acts on the first observation that exceeds threshold — no averaging delay. Second, selectPolicy: Max takes the larger of the two scale-up policies, so a large fleet doubles while a small fleet gets at least 10 pods added immediately. Third, minReplicas: 3 is deliberate: a user-facing service at 2 replicas loses 50% of its capacity during a single pod restart; at 3 replicas the loss is 33% and the remaining pods can typically absorb the difference.
The maxReplicas ceiling problem. Teams set maxReplicas conservatively and rarely revisit it. On launch day, HPA hits the ceiling while load keeps climbing. Treat maxReplicas as a circuit-breaker, not a steady-state target — size it at the volume you would genuinely be willing to serve at full compute cost, and verify that your namespace ResourceQuota and available node capacity can actually support that number. Silent quota failures are covered in the pitfalls section below.
Layer 2 — VPA in recommendation mode: right-sizing what HPA scales
The Vertical Pod Autoscaler (VPA) is the least-deployed component in the autoscaling stack and one of the highest-leverage in preparation. Running it in Recommendation mode against real workloads for two to four weeks before a major traffic event tells you what your requests and limits should actually be based on observed usage — before you need to know.
The reason this matters for spike handling directly: HPA scales horizontally based on the ratio of current resource utilization to the target threshold. If your CPU request is set to 1 CPU but the pod never actually uses more than 200 millicores under real load, Kubernetes reports 20% utilization even when the service is genuinely busy. The over-provisioned request masks the actual saturation signal, and HPA sits idle while latency climbs. VPA surfaces the real baseline.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Recommendation only — do not auto-evict in production
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2GiWith updateMode: "Off", VPA computes and stores recommendations without touching pods. Review the recommendations before any major event (kubectl describe vpa api-server-vpa) and apply them manually in your deployment manifests. The combination of accurate requests and HPA on a custom metric — RPS or P95 latency rather than CPU percentage — is substantially more responsive than CPU-percentage HPA against an oversized request.
One hard constraint: do not run VPA in Auto or Recreate mode on the same pods that HPA manages. The two controllers can conflict — VPA tries to resize pods while HPA is simultaneously creating and deleting them, and the resulting evictions interrupt service at the worst possible moment. The safe production pairing is VPA on recommendations, HPA on a custom metric, and the two reviewed in combination before an anticipated event.
Layer 3 — node provisioning: Cluster Autoscaler versus Karpenter
Pods that HPA creates but cannot schedule are not serving traffic. The most common reason new pods sit in Pending is that no node has sufficient allocatable capacity — meaning the pod's requests exceed what is free across every node. This is the node-provisioning layer, and its two primary implementations have very different latency profiles for spike scenarios.
Cluster Autoscaler watches for Pending pods, identifies which configured node group could satisfy them, and adds a node to that group via the cloud provider's autoscaling API (ASG on AWS, MIG on GCP). The ASG must then launch an instance, complete the instance boot sequence, run the bootstrap script, register with the Kubernetes API, and pass the node readiness check. End-to-end, this typically takes 3 to 5 minutes depending on AMI size, instance type availability, and bootstrap script length. For gradual load growth this latency is acceptable. For a step-change spike it is a multi-minute gap between HPA requesting pods and those pods entering service.
Karpenter takes a fundamentally different approach: it calls the cloud provider's fleet API directly (EC2 Fleet on AWS), bypassing the autoscaling group layer entirely. According to CNCF and AWS documentation following the Karpenter v1.0 release in late 2024, node provisioning typically completes in under 60 seconds, often closer to 30 seconds in warm-region configurations. Karpenter also consolidates idle capacity continuously — the WhenEmptyOrUnderutilized disruption policy bin-packs workloads onto fewer nodes and terminates underutilized ones on an ongoing basis, reducing the idle waste captured in Cast.ai's benchmarks.
Cluster Autoscaler
- Watches Pending pods, triggers ASG node addition
- 3–5 minute end-to-end provisioning including ASG and boot
- Node groups require manual instance-type selection per group
- Limited bin-packing — one node group at a time
- Scale-down requires drain, cordon, and delete cycle
Karpenter
- Calls EC2 Fleet API directly — no ASG intermediary
- Sub-60-second provisioning in most AWS configurations
- NodePool accepts a family of instance types; Karpenter picks the best fit
- Bin-packs across instance types and AZs in a single scheduling pass
- Disruption budgets control consolidation continuously
A minimal Karpenter NodePool configured for a spike-tolerant API tier:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: api-tier
spec:
template:
metadata:
labels:
workload: api-tier
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: api-nodes
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m6i.xlarge", "m6a.xlarge", "m7i.xlarge", "m7a.xlarge"]
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
limits:
cpu: "400" # hard ceiling — prevents runaway scale-up
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30sFor known events — product launches, scheduled batch imports, marketing sends — pre-provision warm capacity. The simplest approach: inflate your Deployment replica count or temporarily raise minReplicas two to four hours before the expected load. Karpenter provisions the nodes, the nodes warm, and HPA takes over from there. This eliminates the cold-start provisioning cost during the critical first minutes when error budget is depleting fastest.
Layer 4 — event-driven scaling with KEDA
HPA on CPU or even on RPS is a reactive instrument: it scales after load arrives and metrics climb. For workloads driven by queues, streams, or scheduled events, queue depth or consumer lag is a better signal — it lets you scale workers before the backlog becomes user-visible latency on the downstream consumer.
KEDA (Kubernetes Event-Driven Autoscaling) is a graduated CNCF project that bridges this gap. It adds a custom metrics server and a library of scalers for common event sources: SQS, Kafka, RabbitMQ, Redis Streams, Azure Service Bus, Prometheus, and scheduled cron windows among others. KEDA's ScaledObject wraps your Deployment or StatefulSet and drives a standard HPA under the hood — so the pod scheduling and node provisioning path remains unchanged.
A KEDA scaler for an SQS-backed order processor:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 2
maxReplicaCount: 60
pollingInterval: 10 # check queue depth every 10 seconds
cooldownPeriod: 120 # wait 2 minutes of idle before scaling to minReplicaCount
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
queueLength: "50" # target: 50 messages per active worker
awsRegion: us-east-1
scaleOnInFlight: "true"With queueLength: "50", KEDA targets one worker pod per 50 in-flight messages. The arithmetic is direct: if an intake event causes queue depth to climb from 100 messages to 2,000, KEDA requests 40 worker pods (2,000 divided by 50). Starting from minReplicaCount: 2, that is 38 additional pods requested in a single scaling decision — independently of CPU utilization, which may barely have moved if the workers are I/O-bound network callers.
The cron scaler is underused for predictable traffic patterns. A marketing email blast that sends every Tuesday at 09:00, a scheduled nightly import, a known product launch window: these are cases where the load is known in advance and you can pre-warm replicas before any queue depth or metric threshold fires:
triggers:
- type: cron
metadata:
timezone: America/New_York
start: "50 8 * * 2" # 8:50am Tuesday — 10 minutes before the blast
end: "30 10 * * 2"
desiredReplicas: "20"
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/emails
queueLength: "100"
awsRegion: us-east-1The cron trigger holds 20 warm replicas for the window. Once the queue fills and the queue-depth trigger fires, KEDA hands control to that trigger and scales further as needed. The cron trigger ensures nodes and pods are provisioned and warm before the first message arrives.
Defend the baseline: caching and load shedding
Scaling compute to absorb traffic you could have served from cache is the most expensive form of capacity planning. A request answered by a CDN edge node never reaches your cluster. A request answered by Redis never hits your database. Every request that flows through to your application pods during a spike is a request you chose not to handle more cheaply.
CDN for static and quasi-static content. Product pages, asset files, and anything that can tolerate a TTL of 30 seconds or more belongs behind a CDN. A flash sale landing page cached globally at 50 CDN edges is not a Kubernetes scaling problem on launch day — it is a static file problem. Most CDN providers serve millions of requests per second from edge infrastructure that costs a fraction of compute scaling.
In-process caching for hot read paths. The database connection pool is the most common silent failure mode during a spike — not CPU saturation, but the PostgreSQL or MySQL connection ceiling being hit when 40 newly-provisioned pods each try to open their default connection pool. Cache the top read queries with a TTL that matches your data freshness requirements. A 5-second TTL on a product detail endpoint eliminates the vast majority of database fan-out during a traffic spike at the cost of acceptable staleness.
Rate limiting at the ingress layer. NGINX Ingress and Envoy both support per-client rate limiting before traffic reaches pods:
# NGINX Ingress annotations
nginx.ingress.kubernetes.io/limit-rps: "20"
nginx.ingress.kubernetes.io/limit-connections: "5"
nginx.ingress.kubernetes.io/limit-burst-multiplier: "3"During a spike driven by a bot, a misconfigured integration, or an unexpected viral moment, rate limiting protects legitimate users without requiring any additional compute. The limit is enforced at the ingress layer, not in application code, so it costs nothing in pod CPU.
Circuit breakers for downstream dependencies. Use a circuit-breaking library on calls to downstream services that are not in your control — payment providers, third-party APIs, internal services with separate SLOs. If a downstream service is degraded, fail fast and return a cached or degraded response rather than holding goroutines or threads open waiting for a timeout, exhausting your connection pool, and cascading the failure upstream. Libraries such as resilience4j (JVM) and go-resiliency (Go) expose circuit-breaker and bulkhead primitives with minimal overhead.
Load testing: from hypothesis to evidence
An autoscaling configuration you have never tested under load is a hypothesis. The specific failure mode for spike scenarios is a step-change: traffic goes from near-zero to many times peak within seconds, with no warm-up period for the autoscaler. That is the test most teams skip.
- 01
Baseline measurement
Run steady-state load at 50% of expected peak. Record HPA replica count, P95 latency, CPU per pod, and database connection utilization. This is your control measurement.
- 02
Ramp to expected peak
Increase load to 100% expected peak over 5 minutes. Observe HPA reaction lag, node provisioning delay, and pod pending duration. Measure time from metric threshold breach to first new pod reaching Ready.
- 03
Step-change test
Drop load to baseline, then instantly apply 2x expected peak load with no ramp. This is the launch-day scenario. Measure error rate during the gap, pod pending time, and time-to-recovery.
- 04
Failure injection
While at peak load, kill 30% of pods using kubectl delete. Verify PodDisruptionBudgets hold, Karpenter or Cluster Autoscaler replaces nodes, and recovery completes within your error budget.
- 05
Endurance run
Hold 80% of peak for 30 minutes. Surface connection pool exhaustion, memory leaks under sustained concurrency, and any node-level resource pressure that does not show up in short tests.
Source: ClimsTech Engineering practice
Tools worth knowing: Grafana k6 for scripted, TypeScript-based scenarios with built-in Prometheus metrics export and a clean ramp/spike DSL; Locust for Python-native scenarios where your team has Python fluency and needs dynamic behavior logic; Artillery for declarative YAML-first scenarios with minimal scripting overhead. All three can generate realistic traffic shapes and export metrics to your existing observability stack.
What to instrument during a load test beyond P50/P95/P99 latency:
- Pod pending time — the delta between pod creation and pod reaching
Ready. If this exceeds 90 seconds, node provisioning or image pull is your bottleneck. - Database connection pool saturation —
pg_stat_activityrows approachingmax_connections. A full pool is invisible at the HTTP layer until requests start queuing behind a connection wait. - HPA metric scrape lag — how quickly does your custom metric reflect injected load? A 60-second Prometheus scrape interval means a 60-second blind spot. Match the scrape interval to your acceptable reaction time.
- Downstream dependency error rates — a third-party API that returns 200 OK at steady state may start rate-limiting you at 10x normal volume. Discover this in the load test, not in production, with a real customer in front of you.
"It should scale" and "it scaled — here is the graph" are very different conversations to have with a CTO the night before a major launch.
We have built and load-tested Kubernetes platforms to 100,000 concurrent connections — not because that is a typical workload, but because the only way to know where a system bends is to push it there in a controlled environment before the spike does it for you.
Seven pitfalls and their fixes
1. PodDisruptionBudget blocking scale-down.
A PDB with minAvailable: 100% prevents the Cluster Autoscaler or Karpenter from draining nodes — idle capacity accumulates, costs climb, and autoscaling signals become unreliable. Fix: set minAvailable to the minimum replica count that can serve traffic safely, typically 50–70% of your steady-state count, and never at 100% unless you have a hard contractual reason.
2. ResourceQuota ceiling hit silently.
Namespace-level ResourceQuota objects cap total CPU and memory requests. When HPA tries to create pods that would exceed the quota, pod creation fails — HPA reports it has requested the pods but they never materialise, and no alert fires. Fix: set maxReplicas to a value consistent with your namespace quota, and add an alert on kube_resourcequota_used / kube_resourcequota_hard exceeding 80%.
3. Image pull latency adding 60–90 seconds to scale-up.
A 1 GB container image on a cold node adds 60 to 90 seconds of pull time before the container can start. This is dead time — the pod sits in ContainerCreating while traffic queues. Fix: use lean base images (distroless or Alpine derivatives where the application supports it), enable image pre-pulling via a DaemonSet on warm nodes, or build Karpenter custom AMIs with images pre-cached at the OS layer.
4. topologySpreadConstraints creating unschedulable pods.
Spread constraints that require pods across all three availability zones fail when one zone lacks available nodes — and during a spike, Karpenter or the Cluster Autoscaler may provision nodes unevenly across zones. Fix: set whenUnsatisfiable: ScheduleAnyway as the fallback for non-critical spread requirements, reserving DoNotSchedule only for deployments where AZ isolation is a hard reliability or compliance requirement.
5. KEDA polling interval too slow for burst workloads.
The default KEDA polling interval is 30 seconds. For a workload where a queue goes from 0 to 10,000 messages in 10 seconds — an intake event, a batch trigger — this means a 30-second window where workers are not scaling. Fix: set pollingInterval: 10 (or lower for critical paths), and pair queue-depth scaling with a cron trigger for predictable events so warm capacity exists before the queue trigger needs to fire.
6. Liveness probe killing pods during startup.
If the liveness probe fires before the application has fully initialized — loaded caches, established database connections, completed startup migrations — Kubernetes kills and restarts the pod in a loop, exactly when you need it to reach Ready under high load. Fix: set initialDelaySeconds on the liveness probe to a conservative value (30–60 seconds for most JVM or interpreted-runtime services), and use a separate startupProbe with a longer failureThreshold to gate liveness checks entirely until bootstrap completes.
7. HPA thrashing on noisy metrics. A custom metric that oscillates around the target threshold — a P95 latency metric with high variance, or an RPS counter with a short averaging window — drives constant scale-up and scale-down events, increasing pod churn and reducing average availability. Fix: ensure your baseline metric value sits at least 15–20% below the HPA target threshold under normal load (beyond the default 10% tolerance band), use a 1-minute rolling average rather than a point-in-time value at the Prometheus recording rule layer, and widen the scale-down stabilization window to at least 5 minutes.
Choosing your layers: a decision framework
Not every workload needs every layer. The table below maps workload characteristics to the minimum layers that provide reliable spike resilience:
| Workload type | HPA | VPA recos | Karpenter | KEDA | Cache / shed | |---|---|---|---|---|---| | Stateless HTTP API, gradual ramp | Required | Strongly recommended | Recommended | Optional | Required for hot reads | | Stateless HTTP API, step-change traffic | Required (spike config) | Required | Required | Cron trigger useful | Required | | Queue worker, unpredictable depth | Optional | Recommended | Required | Required (queue scaler) | Depends on downstream | | Queue worker, scheduled pattern | Optional | Recommended | Recommended | Required (cron + queue) | Depends on downstream | | Batch / data pipeline | Not applicable | Recommended | Required | Required | Not applicable | | Stateful (StatefulSet, primary DB) | Not applicable | Recommended | Use with PDB caution | Not applicable | Required at app layer |
The general principle: use HPA for user-facing HTTP tiers with custom metrics, KEDA for queue-backed workers, and Karpenter for the node layer regardless of workload type. VPA recommendations cost nothing to collect once deployed and consistently surface request over-provisioning — the root cause behind the industry-average 8% CPU utilization figure that makes every other scaling layer less effective than it should be.