Most teams bolt on a cache when something is slow. They pick a round-number TTL — 300 seconds feels good — declare victory on the latency dashboard, and move on. Six months later they have a quietly corrupted user profile, a thundering herd that wobbles the origin on every deploy, and a cache layer that is silently masking a database query that runs 12 seconds cold. The problem is not that caching is difficult; the primitives are simple. The problem is that caching defers cost and hides it. This guide exists to make those costs visible before they show up in an incident report.
95–99%
Cache hit ratio
mostly-static sites (Cloudflare)
50–100ms
Tail latency saved
Cloudflare tiered cache
10–50×
Throughput gain
Redis-cached API endpoints
Source: Cloudflare blog, 2024; Akamai blog, 2024; community Redis benchmarks
The four layers and where your money is
Caching is not a single technology — it is a stack of interception points, each with different reach, latency, and control surface. Working from the user inward:
Browser cache is the best cache hit you will ever get: zero latency, zero bandwidth, zero origin load. The downside is that once the response leaves your servers you have no further control. Get the Cache-Control headers right on the first delivery or you are stuck waiting for the TTL to drain on every user's machine individually — and you cannot accelerate that.
CDN / edge is the highest-leverage server-side cache for global traffic. A cache hit at the edge never touches your origin at all — no compute, no database query, no internal network hop. Cloudflare's own data shows that enabling tiered (two-tier) caching produces 50–100 ms reductions in tail latency for cache hits. Providers running origin-shield configurations consistently report hit ratios above 96% for well-suited content (Akamai and Fastly, 2024 benchmarks).
Application cache (Redis, Memcached, or an in-process LRU) is the right layer for computed results, aggregated queries, session state, and data that requires application logic to assemble. This is where thundering herds live, where invalidation gets complicated, and where most of the interesting engineering happens.
Database-level caching — query result caches, InnoDB buffer pools, PostgreSQL shared_buffers — is tuning, not architecture. You configure it; you rarely build against it directly. It helps, but it does not substitute for the layers above.
| Layer | Typical hit latency | Your control | Hardest failure mode | |---|---|---|---| | Browser | 0 ms (local disk) | Headers at delivery time only | Stale after deploy; no recall | | CDN / edge | 2–20 ms | You + CDN configuration | Vary header explosion; stale after purge failure | | Application (Redis) | 0.2–2 ms | Full | Stampede; memory eviction; serialization bugs | | Database buffer pool | 0.05–0.5 ms | DBA / cloud config | Cold start after failover |
Push work outward as far as correctness allows. Every request served at the edge is a request that never reaches your origin, your application servers, or your database.
These figures are approximate community benchmarks and vary significantly by workload, content mix, and configuration. Your actual hit ratio depends on request diversity, content freshness requirements, and whether you have configured cache key normalization to collapse equivalent requests.
What belongs in a cache — and what will hurt you
The question is not "can I cache this?" — you can cache almost anything. The question is "what is the cost of serving this stale, and what is the cost of serving it to the wrong person?"
Good cache candidates share three properties: they are expensive to produce (slow query, external API call, CPU-intensive computation), they are read far more often than they are written, and the business consequence of serving a value that is a few seconds or minutes stale is acceptable. Rendered HTML pages, product catalog data, aggregated analytics, feature flag payloads, and reference tables all qualify.
Bad candidates are where teams get burned. Financial balances, inventory levels, authentication token validity, and per-user sensitive data are poor fits for a shared cache. Serving an out-of-date account balance is a correctness bug. Caching per-user data under a shared cache key is a security vulnerability — one user can see another user's data.
| Data type | Cache it? | Notes | |---|---|---| | Rendered product page (public) | Yes | TTL matches acceptable staleness; tag by entity | | User-specific dashboard | Cautiously | Scope cache key to user ID; keep TTL short | | Aggregated analytics (last 24h) | Yes | Staleness is inherent in time-bucketed aggregates | | Account balance / inventory count | No | Must be exactly current; caching adds risk, no benefit | | External API response (FX rates, weather) | Yes | Cache to absorb external rate limits; expose last-refresh timestamp | | Authentication token validity | No | Token revocation must take effect immediately | | Error responses (4xx, 5xx) | 1–5 s max | A cached 503 outlasts the actual outage; treat this as its own pitfall |
TTL arithmetic: stop picking round numbers
The single most common caching mistake is a cargo-culted TTL. Someone picked 300 seconds years ago. It has been copied across every service ever since. Nobody knows why.
Choosing a TTL should start with the staleness tolerance of the consumer, not a gut feeling. Ask: if this value is wrong for X seconds, what is the worst-case business impact? For a product catalog, stale for 60 seconds is harmless. For a fraud signal, stale for 60 seconds could mean an approved fraudulent transaction slips through. Map business requirements to seconds before writing code.
Once you have a staleness budget, the second consideration is synchronized expiry. If you populate a large catalog under a single TTL at load time, every entry expires at the same moment. The fix is jitter: add a random offset to each TTL at write time.
import random
BASE_TTL = 300 # seconds
JITTER = 60 # spread expiry over a 60-second window
def jittered_ttl(base: int = BASE_TTL, spread: int = JITTER) -> int:
return base + random.randint(-spread // 2, spread // 2)
cache.set(key, value, ex=jittered_ttl())This spreads expiry uniformly across a 60-second window. For a catalog of 10,000 keys, this cuts the peak per-second expiry rate from 10,000 / 1 = 10,000 keys/s to 10,000 / 60 ≈ 167 keys/s — a 60-fold reduction in synchronized miss load.
Stale-while-revalidate is the cleanest strategy for user-facing reads. The stale value is served immediately while a background process asynchronously refreshes the entry. No user waits on a cold rebuild. This is native in Cache-Control for CDNs:
Cache-Control: public, max-age=60, stale-while-revalidate=30This tells the CDN: serve from cache for up to 60 seconds; if the entry is between 60 and 90 seconds old, serve the stale copy and trigger a background revalidation; after 90 seconds, block until revalidated. For Redis, you implement this manually by checking the remaining TTL on read. If the TTL is below a threshold, fire a non-blocking background refresh task while returning the current value to the caller.
Worked example. Suppose you cache a product pricing page. You receive 1,000 requests per second across 10,000 unique product pages. The origin takes 200 ms to serve each page. Without caching: 1,000 × 200 ms = 200 origin-seconds of work per second. With a 60-second TTL and 10,000 unique pages distributed across the TTL window: expected miss rate is 10,000 / 60 ≈ 167 misses/s. At 200 ms per miss: 33 origin-seconds per second. That is an 83% reduction in origin load, from a single caching layer, without changing a line of origin code. The remaining misses can be further protected with stampede defences.
Cache invalidation: the four real strategies
Cache invalidation is not just hard. It has failure modes you cannot observe until users start noticing.
The cliché survives because the failure modes are real. There are four practical strategies, each with different complexity and correctness tradeoffs:
1. Pure TTL. Write once; let entries age out. Simple, no extra infrastructure, and the worst case is bounded staleness. The right choice when staleness is acceptable and the write rate is low.
2. Write-through. On every write, update both the database and the cache in the same operation. Reads are always fast; the cache is never stale. The cost is write latency (two writes per transaction) and the risk of split-brain if the cache write fails after the database write succeeds. Mitigate this with an outbox pattern or a transactional cache write using Redis pipelining.
3. Cache-aside with event-driven invalidation. The application writes only to the database and publishes a domain event. A separate consumer listens for events and deletes the corresponding cache keys. The cache repopulates lazily on the next read. This decouples the cache from the write path and is the standard pattern in event-driven architectures.
4. Surrogate key / tag-based invalidation. Fastly and Cloudflare both support this natively (Fastly calls them surrogate keys; Cloudflare calls them cache tags). Each cached response is tagged with logical identifiers — a product page can carry tags product:123, brand:acme, and category:footwear. A single purge call targeting brand:acme instantly evicts every response associated with that brand. This is the right approach for complex content graphs where one entity change invalidates many downstream pages.
- 01
Write to database
Application commits the authoritative state change — user update, price change, config toggle. The cache is not touched at write time.
- 02
Emit domain event
The write path publishes an event to a durable queue (Kafka, SNS, Redis Streams) carrying entity type and primary key. The write path is complete at this point — the cache is decoupled.
- 03
Invalidation consumer processes event
A lightweight consumer translates the event into one or more cache keys and issues DEL or UNLINK commands. UNLINK is preferred: it unlinks the key immediately but reclaims memory asynchronously, keeping Redis latency low.
- 04
Next read repopulates the cache
The first reader after invalidation hits the origin, executes the full query or computation, and writes the new value with a fresh TTL. Combine this with singleflight to avoid stampede on the repopulation itself.
- 05
Monitor the miss rate
Alert if the post-deploy miss rate spikes unexpectedly or stays elevated beyond a few minutes. This indicates broken invalidation or an upstream regression, not a warming period.
Source: Standard write-invalidate pattern
One pattern to avoid: flushing the entire cache on every deploy. It is safe, but it destroys cache effectiveness and creates a predictable thundering herd on startup. Prefer key-scoped invalidation tied to actual data changes.
The thundering herd problem, in depth
The thundering herd — also called cache stampede or dog-pile effect — is the failure mode where a cache entry expires under load and every concurrent request misses simultaneously, routing to the origin at the same time. The origin, which was sized for cached load, absorbs the full request volume at once. For a popular key — a landing page, a hot API endpoint, a shared configuration object — this can be hundreds or thousands of simultaneous origin requests in a fraction of a second. Origin latency spikes. Errors accumulate. The failure feeds back: slow origin responses mean cache entries stay empty longer, which means more requests keep hitting the origin.
No stampede protection
- Popular key expires under sustained load
- All N concurrent requests miss simultaneously
- Origin receives N parallel queries for identical data
- Origin latency spike causes a timeout cascade
- Key stays empty longer, amplifying the stampede
Singleflight and stale-while-revalidate
- Key nearing expiry: one background goroutine refreshes early
- Stale value served to all readers during background refresh
- Singleflight ensures exactly one rebuild runs regardless of concurrency
- Origin receives exactly one request per cache miss event
- New value atomically replaces stale entry on completion
Defence 1: Singleflight / request coalescing
The singleflight pattern ensures that for any given cache key, only one goroutine (or thread) executes the origin fetch while all other concurrent callers block and share the result. Go ships this in the standard library under golang.org/x/sync/singleflight; the pattern is equally implementable in any language with a concurrent map of in-flight requests.
var group singleflight.Group
func getProductPage(id string) ([]byte, error) {
val, err, _ := group.Do("product:"+id, func() (interface{}, error) {
// Only one goroutine executes this block per unique key at a time.
// All others block here and receive the same result.
page, err := fetchFromOrigin(id)
if err != nil {
return nil, err
}
cache.Set("product:"+id, page, jitteredTTL())
return page, nil
})
if err != nil {
return nil, err
}
return val.([]byte), nil
}This reduces N simultaneous origin requests to exactly 1, regardless of concurrency. The tradeoff: every caller blocks until the single in-flight request completes. If the origin is slow, all callers wait. Combine this with stale-while-revalidate to eliminate the blocking period for most requests.
Defence 2: Probabilistic early expiration (XFetch)
The XFetch algorithm avoids the blocking problem entirely by triggering a cache refresh before the TTL expires, with a probability that grows as expiry approaches. No distributed lock is required. The key insight is that the refresh fires earlier for keys that are expensive to recompute (large delta), giving more safety margin to the operations that need it most.
import time, math, random
BASE_TTL = 300
def get_with_xfetch(cache, key: str, beta: float = 1.0):
value, expiry, delta = cache.get_with_metadata(key)
# delta = measured time to recompute this key, in seconds
# Probability of early recompute increases as expiry approaches.
# beta=1.0 is the empirically optimal default.
early_recompute = -delta * beta * math.log(random.random())
if (time.time() + early_recompute) >= expiry:
value = recompute(key)
cache.set(key, value, ex=BASE_TTL)
return valuebeta controls aggressiveness. beta = 1 is the standard recommendation. delta is measured from real previous recomputes, so the algorithm self-calibrates to expensive operations automatically.
Defence 3: Jittered TTLs
As covered in the TTL section, spreading expiry times across a time window is the simplest first-pass defence. It does not help when a single very popular key expires, but it prevents mass simultaneous expiry across large key populations that were populated in a burst — typically at startup or after a cache flush.
Layer the defences: jitter prevents population-level stampedes; singleflight or XFetch handles individual hot keys.
CDN caching in practice
Edge caching is the highest-ROI single change for most public web applications, but the HTTP semantics are easy to misconfigure in ways that are invisible until something breaks.
The Cache-Control header is the contract. CDNs and browsers follow the response headers you send. The three most useful configurations:
# Public content: fresh for 60 s, then serve stale up to 30 more seconds during revalidation
Cache-Control: public, max-age=60, stale-while-revalidate=30
# Private or dynamic content: never cache
Cache-Control: private, no-store
# Cache at CDN (s-maxage) but not in browser (max-age=0)
Cache-Control: public, max-age=0, s-maxage=3600, must-revalidateThe Vary header is the most common CDN footgun. Vary: Accept-Encoding is safe — it instructs CDNs to maintain separate entries for compressed and uncompressed responses. Vary: Cookie is a disaster: it creates a distinct cache entry for every unique cookie combination, which in practice means a separate entry per user per URL. Your cache hit ratio drops to near zero. If you need user-specific content at the edge, use edge worker logic to strip or scope cookies before cache lookup, not Vary.
Origin shield / tiered caching. Most CDN providers offer a shield configuration where a dedicated tier of nodes proxies all other edge nodes before reaching your origin. Instead of every global PoP independently missing and hitting your infrastructure, only the shield nodes make origin requests. Cloudflare calls this Tiered Cache; Fastly calls it Origin Shield; AWS CloudFront has Origin Shield as an add-on feature. Enabling this is typically a one-click configuration change that meaningfully reduces origin request volume at high traffic levels.
Cache key normalization. By default, https://example.com/product?id=123&ref=email and https://example.com/product?id=123&ref=social are distinct cache keys, even if ref is irrelevant to the rendered response. Configure your CDN to strip known tracking and analytics parameters:
# Cloudflare Cache Rules — strip query params that do not affect response content
cache_key:
query_string:
exclude:
- utm_source
- utm_medium
- utm_campaign
- ref
- fbclid
- gclidEach unnecessary cache key shard reduces your effective hit ratio and wastes CDN storage. On high-traffic properties, fixing cache key normalization often moves hit ratios by double-digit percentages.
Production pitfalls and fixes
These are the failure modes that appear repeatedly in real systems — not in toy examples.
Caching error responses. If your origin returns a 503 and your CDN or application code does not explicitly handle the status, many caches will store the error and serve it to every subsequent request for the full TTL. A cached 503 is a self-inflicted extended outage that survives the actual recovery. The underlying service is back; every user still sees the error because the cache has it locked in.
Fix: in your CDN configuration, set error response TTL to zero. In application code, check the response status before writing to Redis. On error, either skip the cache write entirely or write a very short entry to prevent hammering the origin during an outage.
Missing Vary header on locale or user-agent content. If you serve different HTML for Accept-Language: en vs Accept-Language: fr, but do not declare Vary: Accept-Language, the CDN caches the first-served language and returns it to all subsequent users regardless of their preference. This is a quiet correctness bug that only manifests in production with real geographic diversity. The same pattern applies to device-type differentiation — if you render a different layout for mobile, you need either Vary: User-Agent (with careful normalization, since the User-Agent string space is enormous) or a separate URL structure.
Cold-start thundering herd after a deploy or restart. When you deploy a new application version or restart a Redis-backed service fleet, the application-level cache is empty. All instances simultaneously miss on the same keys and stampede the database. This is especially severe if your database connection pool is sized for cached load — and it usually is.
Fix: warm the cache explicitly in your deploy pipeline before routing production traffic. A pre-warm script that fetches the top-N most-read keys is usually sufficient. Combine this with a gradual traffic shift (canary routing or weighted load balancing) so the cache warms under partial load before taking full traffic.
Memory pressure and silent eviction. Redis evicts keys when it approaches its maxmemory limit according to its configured policy. The default in many managed offerings is noeviction — Redis returns errors on new writes, which crashes application code that does not handle nil write responses. Other policies (allkeys-lru, volatile-lru) silently drop keys with no error. If you are not tracking the eviction rate, you may not know that Redis is quietly discarding cache entries your application assumes are present.
# Inspect current eviction policy and total evictions since last restart
redis-cli CONFIG GET maxmemory-policy
redis-cli INFO stats | grep evicted_keysFix: set maxmemory-policy explicitly and deliberately for your use case. Track evicted_keys as a first-class dashboard metric. Size Redis to hold your working set with at least 20% headroom, accounting for key metadata overhead (approximately 50–100 bytes per key beyond the value itself).
Inconsistent cache key construction. user:123:profile and user:123:Profile are different keys. product:id=123 and product:123 are different keys. When key construction is spread across multiple services or ad hoc string formatting at call sites, drift is inevitable and usually silent.
// Single source of truth for all cache key shapes
export const CacheKey = {
product: (id: string) => `product:v2:${id}`,
userProfile: (userId: string) => `user:v1:${userId}:profile`,
catalogPage: (category: string, page: number) => `catalog:v1:${category}:${page}`,
} as const;Caching per-user data under a shared key. This is the security-incident variant of incorrect key scoping. If two code paths produce the same cache key for logically different users — for example, a route that caches by URL path but is actually user-specific — then user A will receive user B's cached response. This is a data leak.
Fix: before writing to an application cache, ask: if the wrong user received this response, what would be the impact? If the answer is anything other than "nothing bad," the cache key must include the authenticated user identifier. Treat shared-cache keys as public data by definition — because under failure, they are.
Treating the cache as a primary data store. A cache should make a healthy system faster, not be the thing a system cannot survive without. A Redis failover, a CDN purge after a bad deploy, or a cold restart should cause a temporary latency increase — not an outage. If your system cannot function at all with an empty cache, you have built a fragile dependency rather than an acceleration layer. Size your origin connection pools for un-cached load on your top hot paths, test cold-start behavior in staging, and know which queries are too slow to serve cold. If a query takes 8 seconds without cache, the fix is an index or denormalization — not more aggressive caching to hide the problem.