Why caching is a performance pattern (and why it can backfire)
Caching is the deliberate reuse of previously computed or previously fetched results to reduce repeated work. In polyglot systems, caching is often the fastest “optimization” because it can remove entire code paths: fewer database round-trips, fewer network calls, fewer expensive computations, fewer allocations for repeated transformations. The trade-off is that caches consume memory, can return stale data, and can create hidden latency spikes when they miss or when they evict aggressively.
This chapter focuses on patterns you can apply in Python, Ruby, Java, and C to build caches that are fast, safe, and memory-conscious. The goal is not “cache everything,” but to cache the right things with explicit bounds, predictable eviction, and data structures that minimize overhead.
Core questions to answer before implementing a cache
- What is the key? Make it stable, hashable, and cheap to compute. If key creation is expensive, you may need to cache key derivation too.
- What is the value? Prefer compact representations; avoid storing large object graphs when a compact struct/tuple/byte array would do.
- What is the freshness model? Time-based (TTL), version-based (etag/revision), or event-based invalidation.
- What is the bound? Maximum entries, maximum bytes, or both. Unbounded caches are memory leaks with better marketing.
- What is the concurrency model? Single-threaded, multi-threaded, multi-process, or distributed. This affects locking and stampede control.
- What is the miss behavior? Do you compute synchronously, asynchronously, or return a fallback?
Caching patterns you can reuse across languages
1) Read-through cache (cache-aside with loader)
Read-through caching means callers ask the cache for a key; on miss, the cache loads the value (via a loader function), stores it, and returns it. This centralizes caching logic and makes it harder to forget to populate the cache.
Step-by-step:
- Define a key type and normalization rules (e.g., lowercase, trimmed, canonicalized IDs).
- Implement
get(key)that checks the cache map. - On miss, call
loader(key). - Store the result with metadata (timestamp, size, version).
- Return the value.
Common pitfalls: caching failures (e.g., caching exceptions), caching partial results, and stampedes when many threads miss simultaneously.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
2) Write-through and write-behind
Write-through updates the cache and the backing store together. It improves read consistency but increases write latency. Write-behind (write-back) updates the cache immediately and flushes to the store asynchronously; it improves write latency but risks data loss on crash and complicates consistency.
Use write-through for correctness-critical data and write-behind for derived or reconstructable data (or when you have durable queues/journals).
3) TTL + size bound (two-dimensional control)
TTL controls staleness; size bounds control memory. Relying on TTL alone can still blow memory if keys are unbounded. Relying on size alone can keep stale data forever. Combining both gives predictable behavior.
4) Negative caching
Negative caching stores “not found” results (or empty results) for a short TTL to prevent repeated expensive misses (e.g., repeated lookups for absent users). Keep TTL short and be careful with authorization-sensitive queries (a “not found” for one principal may be “found” for another).
5) Cache stampede control (single-flight)
When a hot key expires, many concurrent requests can trigger the same expensive load. Single-flight ensures only one loader runs per key while others wait for the result.
Patterns: per-key locks, futures/promises per key, or “in-flight” markers in the cache.
6) Segmented caches (hot vs warm)
Use a small, fast “hot” cache (e.g., tiny LRU) and a larger “warm” cache (e.g., larger LRU or TTL map). This reduces churn and improves hit rate for truly hot keys while keeping memory bounded.
Eviction policies and when to use them
LRU (Least Recently Used)
Good default for workloads with temporal locality. Implementation usually requires a hash map + doubly linked list (or language-provided ordered map). Memory overhead can be significant because each entry needs pointers/links.
LFU (Least Frequently Used)
Better when some keys are consistently hot over long periods. More complex and can be expensive to update frequency counters. Approximate LFU (e.g., TinyLFU) is common in high-performance caches.
FIFO / Random
Simple and low overhead. Random eviction can be surprisingly effective and avoids maintaining recency metadata, which can reduce CPU and memory overhead.
Clock / Second-chance
Approximate LRU with lower overhead. Useful when you need LRU-like behavior without heavy pointer chasing.
Memory-conscious data structures for caches
Store compact values, not object graphs
Cache entries often outlive request-scoped objects. If you store a rich object graph, you keep many objects alive and increase memory overhead. Prefer compact representations:
- Store IDs and small immutable fields instead of full domain objects.
- Store packed bytes (e.g., struct-packed binary, JSON bytes, or protocol buffers) when appropriate.
- Store tuples/structs rather than nested dictionaries/hashes.
Interning and canonicalization
If keys repeat (e.g., same strings), canonicalize them to reduce duplicates. In Java, string interning can help but must be used carefully to avoid filling the intern pool with unbounded data. In Python, consider using small integers or enums instead of repeated strings for internal keys.
Avoid per-entry overhead where possible
Many cache implementations pay overhead per entry: node objects, wrappers, timestamps, locks. If you have millions of entries, overhead dominates. Consider:
- Storing metadata in parallel arrays (struct-of-arrays) in C.
- Using primitive maps in Java (fastutil, HPPC) when keys/values are primitives.
- Using compact tuples and
__slots__in Python for entry objects if you must allocate them. - In Ruby, prefer arrays/structs over nested hashes for entry payloads.
Byte-size accounting
Entry count limits are easy but can be misleading when values vary in size. For memory-conscious caches, track approximate byte size per entry and enforce a maximum total bytes budget. Exact size is hard in managed languages; approximate size is usually sufficient if you’re consistent.
Python: practical caching with bounds and stampede control
Pattern: bounded LRU with TTL
Python’s functools.lru_cache is convenient but does not support TTL and can keep large values alive until eviction. For TTL + size, implement a small cache object that stores (value, expires_at, size_estimate) and evicts by LRU.
import time, threading, sys, collections
class TTLRUCache:
def __init__(self, max_entries=1024, ttl_seconds=60):
self.max_entries = max_entries
self.ttl = ttl_seconds
self._lock = threading.Lock()
self._data = collections.OrderedDict() # key -> (value, expires_at)
def get(self, key, loader):
now = time.time()
with self._lock:
item = self._data.get(key)
if item is not None:
value, exp = item
if exp > now:
self._data.move_to_end(key)
return value
else:
# expired
self._data.pop(key, None)
# Load outside lock to avoid blocking other keys
value = loader(key)
exp = now + self.ttl
with self._lock:
self._data[key] = (value, exp)
self._data.move_to_end(key)
while len(self._data) > self.max_entries:
self._data.popitem(last=False)
return valueStep-by-step notes:
- Use
OrderedDictto maintain recency with low complexity. - Check expiry on read; remove expired entries lazily.
- Load outside the lock to reduce contention, but this can cause stampedes.
Adding single-flight (per-key in-flight futures)
To prevent stampedes, track in-flight loads per key. One thread loads; others wait for the result.
import threading
class SingleFlight:
def __init__(self):
self._lock = threading.Lock()
self._events = {} # key -> (Event, result, error)
def do(self, key, fn):
with self._lock:
if key in self._events:
ev, _, _ = self._events[key]
leader = False
else:
ev = threading.Event()
self._events[key] = (ev, None, None)
leader = True
if leader:
try:
res = fn()
err = None
except Exception as e:
res, err = None, e
with self._lock:
self._events[key] = (ev, res, err)
ev.set()
# cleanup after signaling
del self._events[key]
if err:
raise err
return res
else:
ev.wait()
# If you need result sharing, store it elsewhere (e.g., cache) after leader completes
return fn() # fallback: re-check cache in callerIn practice, you combine this with the cache: on miss, call single-flight to run the loader once, then re-check the cache.
Memory-conscious entry payloads
If you cache many entries, avoid storing large dicts per entry. Use tuples or small dataclasses with slots=True (Python 3.10+ supports __slots__ in dataclasses) to reduce per-object overhead.
Ruby: LRU/TTL and compact payloads
Pattern: LRU with Hash + doubly linked list
Ruby’s Hash preserves insertion order, which can be used for simple LRU-like behavior by deleting and reinserting keys. For moderate sizes, this is often sufficient and avoids implementing a linked list.
class TTLCache
def initialize(max_entries:, ttl_seconds:)
@max = max_entries
@ttl = ttl_seconds
@h = {} # key => [value, expires_at]
end
def get(key)
now = Process.clock_gettime(Process::CLOCK_MONOTONIC)
v = @h[key]
if v
value, exp = v
if exp > now
# refresh recency: delete + reinsert
@h.delete(key)
@h[key] = [value, exp]
return value
else
@h.delete(key)
end
end
value = yield
exp = now + @ttl
@h[key] = [value, exp]
evict!
value
end
def evict!
while @h.length > @max
oldest_key, _ = @h.first
@h.delete(oldest_key)
end
end
endStep-by-step notes:
- Use monotonic time for TTL to avoid issues with wall-clock changes.
- Delete+reinsert updates recency without extra node objects.
- Eviction by removing
@h.firstis O(1) for ordered Hash.
Memory-conscious Ruby values
- Prefer arrays over hashes for fixed-shape payloads (e.g.,
[id, status, ts]). - Freeze immutable cached strings to reduce accidental duplication and to make sharing safer.
- Be careful with symbol creation from untrusted/unbounded input; symbols can behave like interned identifiers and may not be reclaimed promptly in older runtimes.
Java: bounded caches, approximate sizing, and reference-aware patterns
Use a production-grade cache when possible
In Java, a well-tuned cache library (commonly Caffeine) provides high-performance eviction (W-TinyLFU), async loading, refresh-after-write, and size-based eviction with weigher functions. The key memory-conscious feature is that you can bound by entry count or weighted size and let the cache handle concurrency efficiently.
Even if you implement your own, the patterns are the same: explicit bounds, efficient eviction, and stampede control.
Pattern: size-bounded cache with a weigher
When values vary widely, use a weight function to approximate memory cost (e.g., bytes of a byte array, length of a string, number of elements). The cache evicts until total weight is under the limit.
// Pseudocode illustrating the idea; use a cache library for production.
interface Weigher<V> { int weightOf(V v); }
class WeightedCache<K,V> {
final long maxWeight;
long currentWeight = 0;
final java.util.LinkedHashMap<K,Entry<V>> map;
final Weigher<V> weigher;
static class Entry<V> { V v; long expNanos; int w; }
WeightedCache(long maxWeight, Weigher<V> weigher) {
this.maxWeight = maxWeight;
this.weigher = weigher;
this.map = new java.util.LinkedHashMap<>(16, 0.75f, true);
}
synchronized V get(K k, java.util.function.Function<K,V> loader, long ttlNanos) {
long now = System.nanoTime();
Entry<V> e = map.get(k);
if (e != null && e.expNanos > now) return e.v;
if (e != null) removeEntry(k, e);
V v = loader.apply(k);
Entry<V> ne = new Entry<>();
ne.v = v;
ne.expNanos = now + ttlNanos;
ne.w = weigher.weightOf(v);
map.put(k, ne);
currentWeight += ne.w;
evict();
return v;
}
void removeEntry(K k, Entry<V> e) {
map.remove(k);
currentWeight -= e.w;
}
void evict() {
var it = map.entrySet().iterator();
while (currentWeight > maxWeight && it.hasNext()) {
var victim = it.next();
currentWeight -= victim.getValue().w;
it.remove();
}
}
}Memory-conscious choices:
- Use
LinkedHashMapaccess-order to implement LRU without separate node objects. - Store primitive metadata (expiry, weight) alongside the value to avoid wrapper allocations.
- Use a weigher to approximate memory usage and enforce a byte budget.
Reference-aware caching (weak/soft references)
Java offers WeakReference and SoftReference. Weak references are cleared aggressively; soft references are cleared under memory pressure. These can be useful for memoizing derived data where recomputation is acceptable. However, reference-based caches can be unpredictable: eviction depends on GC heuristics, not your policy. Prefer explicit bounds for latency predictability; use weak/soft references only for “nice-to-have” caches.
C: explicit cache structures with tight memory control
Pattern: fixed-capacity hash table + clock eviction
In C, you can build caches with very low overhead by using fixed-size arrays and storing entries in contiguous memory. This improves locality and makes memory usage deterministic. A common approach is:
- Fixed-capacity hash table mapping keys to indices.
- Entry array storing key, value pointer/inline value, expiry, and a reference bit.
- Clock hand for eviction (second-chance): scan entries; if ref bit is set, clear it and continue; if clear, evict.
Step-by-step:
- Choose a maximum number of entries and maximum value size (or store pointers to separately allocated values with a byte budget).
- Define an entry struct with compact fields (use 32-bit where possible).
- Use monotonic time for TTL.
- On access, set the entry’s reference bit.
- On insert when full, run the clock eviction loop until you find a victim.
#include <stdint.h>
#include <string.h>
typedef struct {
uint64_t key_hash;
uint32_t key_len;
uint32_t val_len;
uint64_t exp_ms;
uint8_t ref;
uint8_t used;
// For simplicity: inline small values (fixed max)
char key[32];
char val[128];
} entry_t;
typedef struct {
entry_t *entries;
uint32_t cap;
uint32_t hand;
} clock_cache_t;
static uint64_t now_ms();
// Pseudocode: lookup by linear probing on key_hash; compare key bytes.
// On hit: set ref=1, check exp_ms.
// On insert: if full, evict using clock hand.
static uint32_t evict_one(clock_cache_t *c) {
for (;;) {
entry_t *e = &c->entries[c->hand];
uint32_t idx = c->hand;
c->hand = (c->hand + 1) % c->cap;
if (!e->used) return idx;
if (e->exp_ms <= now_ms()) { e->used = 0; return idx; }
if (e->ref) { e->ref = 0; continue; }
e->used = 0;
return idx;
}
}Memory-conscious choices:
- Inline small keys/values to avoid per-entry heap allocations and fragmentation.
- Use fixed-size buffers only when you can cap sizes; otherwise store pointers and maintain a separate slab/arena allocator with a byte budget.
- Prefer approximate LRU (clock) to avoid pointer-heavy linked lists.
Byte-budgeted value storage with arenas
If values vary in size, store them in an arena (a large buffer) and keep offsets/lengths in entries. When evicting, you can reclaim space by using a ring buffer arena (append-only with wraparound) and invalidating old offsets when overwritten. This is a practical approach for caches where values are immutable and can be recomputed.
Cross-language techniques for memory-conscious caching
Key normalization to improve hit rate and reduce duplicates
Normalize keys so semantically identical requests map to the same entry. Examples: canonicalize URL query parameter ordering, normalize case where appropriate, round timestamps to buckets, or map equivalent IDs to a canonical form. This increases hit rate and reduces the number of entries needed for the same benefit.
Sharding to reduce lock contention
Instead of one global lock, shard the cache into N independent segments by hashing the key and selecting a shard. Each shard has its own map and eviction policy. This improves throughput in multi-threaded Python (despite the GIL, I/O-heavy apps still benefit), Ruby (depending on runtime), Java, and C.
Admission control: don’t cache everything that misses
Some keys are one-offs and will never be reused. Caching them wastes memory and increases eviction churn. Admission control techniques include:
- Only cache after the second request (two-hit policy).
- Use a small “probation” cache; promote to main cache on repeat access.
- Reject entries above a size threshold (don’t let one huge value evict many useful small ones).
Refresh-ahead (stale-while-revalidate)
To avoid latency spikes on expiry, serve slightly stale data while refreshing in the background. This pattern is especially useful for hot keys with expensive loaders.
Step-by-step:
- Store value with two timestamps: soft expiry (refresh time) and hard expiry (must not serve beyond).
- On read: if before soft expiry, serve normally.
- If between soft and hard expiry: serve cached value and trigger async refresh (single-flight).
- If beyond hard expiry: block and reload (or fail fast with fallback).
Cache invalidation by versioning
Instead of deleting keys, incorporate a version into the key: user:123:v17. When the user changes, bump the version. This avoids complex invalidation logic and reduces race conditions. The cost is that old versions remain until TTL/eviction, so you must keep bounds tight.
Choosing what to cache: derived vs source data
Cache derived results (e.g., computed aggregates, formatted payloads) when recomputation is expensive and the derived data is smaller than the source. Cache source data (e.g., raw records) when it is reused across many derived computations. Memory-conscious design often favors caching the smallest representation that can be expanded quickly.
Practical checklist: implementing a safe, bounded cache
- Define bounds: max entries and/or max weight (bytes).
- Define freshness: TTL, versioning, or event invalidation.
- Prevent stampedes: single-flight per key for expensive loads.
- Choose eviction: LRU for temporal locality, clock for low overhead, LFU/approx LFU for stable hot sets.
- Make payloads compact: avoid storing large object graphs; store compact structs/tuples/bytes.
- Account for large values: size thresholds, weighting, or separate “large object” handling.
- Use monotonic time for TTL decisions.
- Plan for failure: decide whether to cache errors, and for how long (usually short negative caching).
- Test eviction behavior: verify that memory stays within budget under adversarial key patterns.