Spot instances represent the single highest-leverage cost lever in a typical cloud bill, and the one engineering teams are most reluctant to pull in production. The reluctance is reasonable — it is rooted in real incidents — but it conflates two distinct problems: running Spot carelessly, and running Spot correctly. This article is about the second kind. The goal is not to convince you that interruptions are rare (they vary significantly by pool and region), but to show you how to design a system where an interruption is a scheduled non-event rather than a pager alert.
Up to 90%
Discount vs on-demand
AWS EC2 Spot pricing
Less than 5%
Avg interruption rate
most popular pools, trailing 30 days
59%
Avg savings, mixed cluster
Cast.ai 2025, 2,100+ orgs
2 min
Interruption notice
AWS EC2 Spot warning window
Source: AWS EC2 Spot Instance Advisor; Cast.ai 2025 Kubernetes Cost Benchmark Report
What Spot actually is (and what the risk model really looks like)
Spot capacity is the cloud provider's spare compute inventory — the headroom between what they've provisioned in a region and what on-demand customers are currently consuming. AWS sells this headroom at a steep discount on the condition that they can reclaim it when demand rises. The mechanism is straightforward: AWS publishes a Spot Instance Interruption Warning via the instance metadata service and EventBridge roughly two minutes before reclaiming the instance. That two-minute window is everything. How you use it — or fail to — determines whether Spot is an operational hazard or a routine cost optimization.
Three things are worth internalising before touching a NodePool config:
Interruption rates are pool-specific, not universal. AWS publishes trailing-30-day interruption frequency data via the EC2 Spot Instance Advisor, broken into bands: less than 5%, 5–10%, 10–15%, 15–20%, and above 20%. The majority of commonly-used general-purpose pools (m5, m6i, c5, c6i families in most US and EU regions) sit in the less-than-5% band. But niche instance families in constrained regions can run far higher. This is not a number to memorise from a blog post — it is a signal to pull per-pool from the Advisor before committing each family to a NodePool. AWS's EC2 Capacity Manager also began publishing Spot interruption metrics in early 2026, making this data easier to surface in dashboards.
The discount is real and relatively stable for general-purpose families. Spot prices fluctuate with capacity demand, but for m5/m6i/c5/c6i in major regions they rarely spike to on-demand levels for extended periods. AWS also provides Spot Price History via the EC2 API and console, so you can model expected cost before committing. The variability risk is not price — it is availability of a specific pool. Diversification is the answer to availability risk, not avoiding Spot.
GCP and Azure have analogous offerings with different mechanics. GCP Spot VMs (formerly Preemptible VMs) provide a 30-second notice rather than two minutes. Azure Spot Virtual Machines use a minimum 30-second eviction notice under the "Deallocate" policy. The architecture principles in this article apply across all three, but the notice window matters: GCP's 30-second window fundamentally changes your graceful shutdown requirements and rules out anything that can't terminate in under 20 seconds.
The real economics: what you're actually buying
The sticker price is intuitive — up to 90% off on-demand — but the realistic savings in production depend on your on-demand baseline ratio, your instance diversification strategy, and whether your workload genuinely fits the fault-tolerance profile. The numbers are large enough that even a conservative implementation pays off materially.
Worked example: a stateless HTTP API running 20 nodes, each m5.xlarge (4 vCPU, 16 GiB), in us-east-1.
As of mid-2025, the on-demand price for m5.xlarge in us-east-1 is approximately $0.192 per hour. Running 20 nodes continuously costs roughly $0.192 × 20 × 730 = $2,803 per month.
A typical Spot price for m5.xlarge in us-east-1 runs around $0.060–$0.075 per hour (roughly 65–70% off on-demand). At $0.068 per hour average, the same fleet costs approximately $0.068 × 20 × 730 = $993 per month — a saving of about $1,810 per month, or 65%, before any architecture changes.
In a conservative production configuration, you keep 4 nodes on on-demand (via a Compute Savings Plan) and run the remaining 16 on Spot. That puts $0.192 × 4 × 730 = $561 per month on on-demand, and $0.068 × 16 × 730 = $794 per month on Spot. Total: $1,355 per month — saving $1,448 per month (52%) while maintaining a guaranteed capacity floor.
These figures use public price data as of mid-2025; actual prices will differ by region, current demand, and Spot price at launch time. The point is that even a deliberately cautious hybrid model produces material savings without gambling 100% of capacity on a single pricing model.
The Cast.ai 2025 Kubernetes Cost Benchmark — drawn from over 2,100 organisations running on AWS, GCP, and Azure across the full calendar year 2024 — found that applications use roughly 10% of provisioned CPU and 23% of provisioned memory on average. This means the bulk of provisioned compute is idle capacity, and that idle capacity is being paid at on-demand rates. Spot is the correct instrument for burst and overflow when your baseline requests are already over-provisioned: you're not taking on meaningful new risk by shifting idle headroom to interruptible capacity.
Workload classification: what belongs on Spot
Not every workload belongs on Spot, and the decision is not binary. The right model is to categorise workloads by their tolerance for abrupt node loss, not by whether they "matter" or handle production traffic.
| Workload type | Spot suitability | Rationale | |---|---|---| | Stateless HTTP API (multiple replicas) | High | Any single-node loss is absorbed by remaining replicas; load balancer drains connections during the two-minute window | | CI/CD build runners | High | Jobs retry on failure; short-lived by nature; interruption is operationally equivalent to a worker crash | | ML training with checkpointing | High | Checkpointed jobs resume from last saved epoch; one-node loss in a distributed job redistributes work | | Async queue workers | High | Message visibility timeouts ensure reprocessing on worker loss; no data is lost, only latency | | Batch data processing (idempotent) | High | Idempotent task design means re-running a task is safe; use structured retries | | Spark / Dask executors (with driver on-demand) | High | Executors are fungible; driver holds state and belongs on on-demand | | Stateful singletons (single-replica databases) | None | Loss of the node means write unavailability; requires HA setup before Spot is even on the table | | Distributed databases (Cassandra, OpenSearch) | Conditional | Safe only if quorum is maintained after a node loss; treat data nodes carefully | | Long-running stateful sessions (no external store) | Low | User sessions lost on interruption without a sticky-session recovery mechanism or external session store | | Low-replica critical path services | Low | If you have one or two replicas and a human notices when one restarts, protect them with on-demand or the do-not-disrupt annotation |
The key distinction is not "important vs unimportant" — it is "can this workload absorb an abrupt node loss without user-visible impact?" A replicated stateless API serving millions of requests per minute can typically absorb losing one of thirty nodes. A single-replica PostgreSQL primary cannot.
Kubernetes architecture: Karpenter and correct Spot topology
On Kubernetes, Karpenter has largely superseded the older Cluster Autoscaler approach for Spot management. The reasons are architectural, not cosmetic. Cluster Autoscaler works at the node group level — Spot diversity requires pre-creating multiple managed node groups, one per instance type family per AZ. That means operational overhead of defining, tagging, and maintaining 15–20 node groups to achieve serious diversification. Karpenter works directly with EC2 Fleet APIs, expressing multi-family diversification as a weighted list of instanceTypes in a single NodePool resource.
A minimal production-ready Karpenter NodePool for Spot:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-general
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.xlarge
- m5a.xlarge
- m5n.xlarge
- m6i.xlarge
- m6a.xlarge
- m7i.xlarge
- m7a.xlarge
- c5.2xlarge
- c6i.2xlarge
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
name: general
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
budgets:
- nodes: "20%"
schedule: "0 9 * * 1-5"
duration: 8h
- nodes: "0"The instance type list is deliberate: eight families across two generations, spread across three AZs. The goal is that no single pool reclamation can take down the whole fleet. When one pool is reclaimed or temporarily unavailable, Karpenter picks from remaining families. This is the single most important architectural decision in a Spot deployment. More families mean lower blast radius per reclamation event.
The disruption budget entries do two things: they allow voluntary consolidation (not interruption-driven replacement, which is always reactive) to affect at most 20% of nodes during business hours on weekdays, and they block it entirely outside that window. This prevents Karpenter from moving pods around during peak traffic or off-hours maintenance windows.
The SQS integration for Karpenter is not optional. Without configuring Karpenter's --interruption-queue parameter pointing to an SQS queue receiving EC2 Spot Instance Interruption Warning events from EventBridge, Karpenter only discovers a Spot reclamation when the node disappears from the Kubernetes API — at which point the two-minute window is already gone. With the SQS integration, Karpenter starts cordoning and draining the affected node within seconds of the warning, preserving the full two-minute window for the pod termination sequence.
# EventBridge rule routing Spot interruption warnings to Karpenter's SQS queue
aws events put-rule \
--name "KarpenterInterruptionRule" \
--event-pattern '{"source":["aws.ec2"],"detail-type":["EC2 Spot Instance Interruption Warning"]}' \
--state ENABLED
# Set in Karpenter's Helm values or deployment args
--interruption-queue=karpenter-interruption-queue- 01
EC2 emits interruption warning
AWS publishes a Spot Instance Interruption Warning via instance metadata and EventBridge, approximately 2 minutes before the instance is reclaimed.
- 02
Karpenter receives event via SQS
EventBridge routes the event to Karpenter's SQS queue. Karpenter cordons the node within seconds, preventing new pods from being scheduled there.
- 03
Node drain begins
Karpenter issues pod evictions. Kubernetes sends SIGTERM to each container. The terminationGracePeriodSeconds countdown starts per pod.
- 04
Application shuts down gracefully
Well-configured applications finish in-flight requests, close database connections, and complete or re-enqueue in-progress jobs. preStop hooks execute first if defined.
- 05
Replacement node provisioned in parallel
Karpenter simultaneously launches a replacement node from an available Spot pool — preferring a different family or AZ — and falls back to on-demand if all Spot pools are exhausted.
- 06
Pods rescheduled and traffic restored
Once the new node passes its Ready check, pending pods reschedule and pass their readinessProbes. Traffic resumes. The interruption leaves no trace in the error rate graph.
Source: AWS EC2 documentation; Karpenter project documentation, 2024
Handling the interruption notice correctly
The two-minute window is only useful if your application actually uses it. Many teams configure Karpenter's SQS integration and then discover their application ignores SIGTERM, takes 90 seconds to exit uncleanly, and forces a SIGKILL that corrupts in-flight state. Here are the specific configurations that determine whether the window is used or wasted.
terminationGracePeriodSeconds sets the maximum time Kubernetes waits after sending SIGTERM before sending SIGKILL. The default is 30 seconds. For most stateless HTTP services, 30 seconds is sufficient to drain in-flight connections. For queue workers that may be mid-job, 60–120 seconds may be appropriate. Set it explicitly rather than relying on the default — the default is not a performance target, it is a failsafe.
preStop hooks execute before the SIGTERM signal is sent to the application, and they complete before the termination grace period countdown begins. A common and effective pattern is a short sleep:
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: api
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]The 5-second preStop sleep gives the load balancer's health check cycle time to remove the pod from its target group before the application begins refusing connections. Without this, the load balancer may continue routing traffic to a pod that has already started shutting down, producing a short burst of 503 errors. This is one of those fixes that eliminates an error class entirely and costs nothing.
Pod Disruption Budgets limit how many pods from a deployment can be simultaneously unavailable, regardless of the cause. They are not Spot-specific, but they are critical in any Spot deployment because Karpenter may need to drain multiple nodes in quick succession during a cascade of interruptions.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: "50%"
selector:
matchLabels:
app: apiA PDB with minAvailable: "50%" ensures that even if two nodes are simultaneously reclaimed, Karpenter cannot evict pods in a way that drops you below half your desired replica count. PDBs also prevent voluntary consolidation from going faster than your application can handle, which matters when you have a slow cold start.
Protecting specific pods from voluntary disruption — for pods that represent long-running work you cannot interrupt (a distributed training job six hours in, a long-running database migration), use Karpenter's annotation:
metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"This prevents Karpenter from voluntarily draining the node for consolidation. It does not prevent AWS from reclaiming a Spot node — for that, you need on-demand capacity or a checkpointing strategy. The distinction matters more than it appears in the docs.
The Spot interruption is not the event that causes the page. The page comes from applications that crash on SIGTERM, health checks that don't surface the drain, and load balancers that aren't watching.
Sizing the on-demand baseline
How much on-demand capacity to keep is partly a financial calculation and partly a business continuity question. There is no universal ratio, but there is a principled framework.
The floor should equal the minimum capacity needed to serve baseline traffic at acceptable latency with zero Spot nodes. If you need 20 replicas to serve peak traffic, and your sustained minimum (say, P5 traffic over 90 days) needs 6 replicas, your on-demand floor is 6. The other 14 run on Spot and scale down gracefully if Spot capacity is temporarily unavailable.
Match Savings Plans to the floor, not to the ceiling. Compute Savings Plans give you 40–60% off on-demand rates for a 1 or 3-year commitment on any instance family and region. The commitment is to a consistent usage amount in dollars per hour — you commit to the compute spend that corresponds to your baseline. Using a Savings Plan for the floor means the floor is already discounted; Spot on top of that reduces the variable-demand cost by 60–90%. The combination gives you a very low blended rate without gambling 100% of capacity on Spot availability.
If your traffic is highly variable — more than 5x between floor and peak — Spot is where the real savings live. A Savings Plan commitment at peak capacity is expensive; a Savings Plan commitment at your baseline with Spot covering the surge is the correct structure.
Naive all-Spot deployment
- All nodes from one instance type in one AZ
- No SQS interruption queue configured
- No Pod Disruption Budgets defined
- terminationGracePeriodSeconds left at default 30 seconds
- Node Termination Handler running alongside Karpenter, creating a drain race
- 100% of capacity at risk from a single pool reclamation event
Production-grade hybrid
- On-demand floor (20–30% of capacity) covered by Compute Savings Plans
- Spot burst across 6–8 instance families, 3+ AZs, per EC2 Spot Advisor
- Karpenter SQS queue for proactive two-minute interruption handling
- PDBs enforce minimum available replicas per deployment
- preStop hooks and tuned terminationGracePeriodSeconds per service
- Node Termination Handler removed; Karpenter owns all interruption events
Pitfalls in production, with fixes
These are the failure modes that turn a Spot architecture from a cost win into an operational liability. Each one has a specific, low-cost fix.
Pitfall 1: Running Node Termination Handler alongside Karpenter.
NTH is the predecessor to Karpenter's built-in interruption handling. Running both creates a race condition: when a Spot interruption warning arrives, both NTH and Karpenter simultaneously attempt to drain the affected node. The result is pods stuck in Terminating indefinitely, eviction failures, and occasionally a node that never fully drains before AWS reclaims it. This is a silent failure mode — it doesn't cause immediate errors, it causes prolonged degradation that looks like a slow rolling deployment.
Fix: Remove NTH entirely when using Karpenter. Karpenter's --interruption-queue replaces all of NTH's Spot-specific functionality. There is no scenario where running both is correct.
Pitfall 2: Single instance type or single AZ.
A NodePool that requests only m5.xlarge in us-east-1a can lose its entire Spot capacity simultaneously if that pool is reclaimed. AWS occasionally reclaims capacity across a full pool in a region when on-demand demand spikes rapidly. If your entire fleet is that pool, you lose capacity entirely until Karpenter falls back to on-demand — and if you haven't configured that fallback, you lose capacity entirely.
Fix: Use at minimum 6–8 instance types across 2+ generations and 3 AZs. Consult the EC2 Spot Instance Advisor and exclude families that consistently sit above 10% interruption rate. Treat the diversification list as a living document — review it quarterly.
Pitfall 3: Voluntary consolidation running during peak traffic.
Karpenter's default disruption budget allows up to 10% of managed nodes to be voluntarily disrupted simultaneously, with no schedule awareness. On a 100-node cluster during a traffic spike, this means 10 pods being evicted and rescheduled while your SLOs are already tight. Rescheduled pods have cold caches, reopened connection pools, and fresh JVM JITs. The latency tail widens.
Fix: Use Karpenter's disruption budget schedule and duration fields to confine voluntary consolidation to low-traffic windows — typically overnight or early morning on weekdays. Keep a nodes: "0" entry as the fallback outside the scheduled window, so consolidation simply doesn't run during business hours.
Pitfall 4: Health checks that pass before the application is actually ready.
When a Spot replacement node provisions and pods reschedule, the readinessProbe determines when the new pod enters rotation. A probe that passes after 1 second for an application that requires 15 seconds to warm its connection pool will route production traffic to a pod that isn't ready, producing errors until the pool initialises.
Fix: Set initialDelaySeconds to match your actual cold-start time. Use a dedicated /ready endpoint that checks not just that the HTTP server is up, but that critical dependencies (cache connection, database pool, any required config fetches) are fully initialised.
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
failureThreshold: 3Pitfall 5: Treating karpenter.sh/do-not-disrupt as Spot protection.
The annotation prevents Karpenter from voluntarily draining a node for consolidation. It does not prevent AWS from reclaiming a Spot node. Teams sometimes annotate long-running jobs expecting full protection from interruption, then discover during an incident that an AWS reclamation bypassed the annotation entirely — because it was never designed to stop that.
Fix: For workloads with multi-hour execution time that cannot tolerate interruption, run them on on-demand nodes or implement proper checkpointing. For ML training, PyTorch, JAX, and TensorFlow all support checkpoint-and-resume; use them. The annotation is a Karpenter-level guard, not a cloud-level one.
Pitfall 6: Oversized on-demand baseline consuming Savings Plans headroom. Teams sometimes size the floor at 50% of peak capacity as a conservative safety measure. The resulting Savings Plan commitment then approaches the cost of running everything on-demand — effectively eliminating the Spot savings. The floor has grown to where it no longer represents "minimum needed" but "I don't fully trust Spot." Fix: Size the floor at your actual sustained minimum (P5–P10 traffic over 90 days), and let autoscaling and Spot handle the rest. If you're uncomfortable with that, start at P25 and shrink it after two weeks of observing interruption impact. The distrust of Spot is almost always based on incidents from misconfigured deployments, not from inherent unreliability.
Monitoring a Spot deployment in production
A Spot architecture changes the signals that matter. Node-level CPU and memory are necessary but not sufficient. Add these four signals to your dashboards.
Spot interruption rate per pool. The EC2 Spot Instance Advisor provides trailing-30-day data by family, size, and AZ. AWS CloudWatch with EC2 Capacity Manager now surfaces Spot interruption metrics directly. Alert if a pool you are actively using crosses above 10% interruption rate — that is a signal to remove that family from your NodePool's instance type list before it causes a wave of simultaneous replacements.
Node replacement latency. Track the time from a Spot interruption event (visible in EventBridge) to a replacement node reaching Ready. If this consistently exceeds 90 seconds, you are burning through the two-minute warning window and leaving almost no time for graceful pod termination. The target on a well-configured cluster is under 60 seconds for a replacement node to be provisioned and join.
Eviction error rate. Failed evictions — pods stuck in Terminating, PDB violations blocking drains — surface in Karpenter's metrics and as Kubernetes events. A spike in eviction errors almost always means your PDB settings are too aggressive or your terminationGracePeriodSeconds is shorter than your application's actual shutdown time.
Downstream correlation. The most important signal is whether Spot interruptions produce visible impact on your application's error rate, latency, or queue depth. On a correctly configured cluster, they should not. A Prometheus query for tracking interruption-driven node churn (assuming Karpenter's metrics endpoint is scraped):
sum(increase(karpenter_nodes_terminated_total{nodepool="spot-general", reason="spot-interruption"}[1h]))Overlay this metric with your application's P99 latency or HTTP error rate on the same time axis. If there is zero visible correlation between Spot interruptions and application metrics, your architecture is correct. If there is correlation, the fault is almost always in the termination sequence — not in the interruption rate itself.