All courses > Technology and Programming > Programming Languages ( Python, Ruby, Java, C ) ::

Caching Patterns and Memory-Conscious Data Structures

Capítulo 10

Estimated reading time: 15 minutes

Why caching is a performance pattern (and why it can backfire)

Caching is the deliberate reuse of previously computed or previously fetched results to reduce repeated work. In polyglot systems, caching is often the fastest “optimization” because it can remove entire code paths: fewer database round-trips, fewer network calls, fewer expensive computations, fewer allocations for repeated transformations. The trade-off is that caches consume memory, can return stale data, and can create hidden latency spikes when they miss or when they evict aggressively.

This chapter focuses on patterns you can apply in Python, Ruby, Java, and C to build caches that are fast, safe, and memory-conscious. The goal is not “cache everything,” but to cache the right things with explicit bounds, predictable eviction, and data structures that minimize overhead.

Core questions to answer before implementing a cache

What is the key? Make it stable, hashable, and cheap to compute. If key creation is expensive, you may need to cache key derivation too.
What is the value? Prefer compact representations; avoid storing large object graphs when a compact struct/tuple/byte array would do.
What is the freshness model? Time-based (TTL), version-based (etag/revision), or event-based invalidation.
What is the bound? Maximum entries, maximum bytes, or both. Unbounded caches are memory leaks with better marketing.
What is the concurrency model? Single-threaded, multi-threaded, multi-process, or distributed. This affects locking and stampede control.
What is the miss behavior? Do you compute synchronously, asynchronously, or return a fallback?

Caching patterns you can reuse across languages

1) Read-through cache (cache-aside with loader)

Read-through caching means callers ask the cache for a key; on miss, the cache loads the value (via a loader function), stores it, and returns it. This centralizes caching logic and makes it harder to forget to populate the cache.

Step-by-step:

Define a key type and normalization rules (e.g., lowercase, trimmed, canonicalized IDs).
Implement get(key) that checks the cache map.
On miss, call loader(key).
Store the result with metadata (timestamp, size, version).
Return the value.

Common pitfalls: caching failures (e.g., caching exceptions), caching partial results, and stampedes when many threads miss simultaneously.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

2) Write-through and write-behind

Write-through updates the cache and the backing store together. It improves read consistency but increases write latency. Write-behind (write-back) updates the cache immediately and flushes to the store asynchronously; it improves write latency but risks data loss on crash and complicates consistency.

Use write-through for correctness-critical data and write-behind for derived or reconstructable data (or when you have durable queues/journals).

3) TTL + size bound (two-dimensional control)

TTL controls staleness; size bounds control memory. Relying on TTL alone can still blow memory if keys are unbounded. Relying on size alone can keep stale data forever. Combining both gives predictable behavior.

4) Negative caching

Negative caching stores “not found” results (or empty results) for a short TTL to prevent repeated expensive misses (e.g., repeated lookups for absent users). Keep TTL short and be careful with authorization-sensitive queries (a “not found” for one principal may be “found” for another).

5) Cache stampede control (single-flight)

When a hot key expires, many concurrent requests can trigger the same expensive load. Single-flight ensures only one loader runs per key while others wait for the result.

Patterns: per-key locks, futures/promises per key, or “in-flight” markers in the cache.

6) Segmented caches (hot vs warm)

Use a small, fast “hot” cache (e.g., tiny LRU) and a larger “warm” cache (e.g., larger LRU or TTL map). This reduces churn and improves hit rate for truly hot keys while keeping memory bounded.

Eviction policies and when to use them

LRU (Least Recently Used)

Good default for workloads with temporal locality. Implementation usually requires a hash map + doubly linked list (or language-provided ordered map). Memory overhead can be significant because each entry needs pointers/links.

LFU (Least Frequently Used)

Better when some keys are consistently hot over long periods. More complex and can be expensive to update frequency counters. Approximate LFU (e.g., TinyLFU) is common in high-performance caches.

FIFO / Random

Simple and low overhead. Random eviction can be surprisingly effective and avoids maintaining recency metadata, which can reduce CPU and memory overhead.

Clock / Second-chance

Approximate LRU with lower overhead. Useful when you need LRU-like behavior without heavy pointer chasing.

Memory-conscious data structures for caches

Store compact values, not object graphs

Cache entries often outlive request-scoped objects. If you store a rich object graph, you keep many objects alive and increase memory overhead. Prefer compact representations:

Store IDs and small immutable fields instead of full domain objects.
Store packed bytes (e.g., struct-packed binary, JSON bytes, or protocol buffers) when appropriate.
Store tuples/structs rather than nested dictionaries/hashes.

Interning and canonicalization

If keys repeat (e.g., same strings), canonicalize them to reduce duplicates. In Java, string interning can help but must be used carefully to avoid filling the intern pool with unbounded data. In Python, consider using small integers or enums instead of repeated strings for internal keys.

Avoid per-entry overhead where possible

Many cache implementations pay overhead per entry: node objects, wrappers, timestamps, locks. If you have millions of entries, overhead dominates. Consider:

Storing metadata in parallel arrays (struct-of-arrays) in C.
Using primitive maps in Java (fastutil, HPPC) when keys/values are primitives.
Using compact tuples and __slots__ in Python for entry objects if you must allocate them.
In Ruby, prefer arrays/structs over nested hashes for entry payloads.

Byte-size accounting

Entry count limits are easy but can be misleading when values vary in size. For memory-conscious caches, track approximate byte size per entry and enforce a maximum total bytes budget. Exact size is hard in managed languages; approximate size is usually sufficient if you’re consistent.

Python: practical caching with bounds and stampede control

Pattern: bounded LRU with TTL

Python’s functools.lru_cache is convenient but does not support TTL and can keep large values alive until eviction. For TTL + size, implement a small cache object that stores (value, expires_at, size_estimate) and evicts by LRU.

import time, threading, sys, collections

class TTLRUCache:
    def __init__(self, max_entries=1024, ttl_seconds=60):
        self.max_entries = max_entries
        self.ttl = ttl_seconds
        self._lock = threading.Lock()
        self._data = collections.OrderedDict()  # key -> (value, expires_at)

    def get(self, key, loader):
        now = time.time()
        with self._lock:
            item = self._data.get(key)
            if item is not None:
                value, exp = item
                if exp > now:
                    self._data.move_to_end(key)
                    return value
                else:
                    # expired
                    self._data.pop(key, None)

        # Load outside lock to avoid blocking other keys
        value = loader(key)
        exp = now + self.ttl
        with self._lock:
            self._data[key] = (value, exp)
            self._data.move_to_end(key)
            while len(self._data) > self.max_entries:
                self._data.popitem(last=False)
        return value

Step-by-step notes:

Use OrderedDict to maintain recency with low complexity.
Check expiry on read; remove expired entries lazily.
Load outside the lock to reduce contention, but this can cause stampedes.

Adding single-flight (per-key in-flight futures)

To prevent stampedes, track in-flight loads per key. One thread loads; others wait for the result.

import threading

class SingleFlight:
    def __init__(self):
        self._lock = threading.Lock()
        self._events = {}  # key -> (Event, result, error)

    def do(self, key, fn):
        with self._lock:
            if key in self._events:
                ev, _, _ = self._events[key]
                leader = False
            else:
                ev = threading.Event()
                self._events[key] = (ev, None, None)
                leader = True

        if leader:
            try:
                res = fn()
                err = None
            except Exception as e:
                res, err = None, e
            with self._lock:
                self._events[key] = (ev, res, err)
                ev.set()
                # cleanup after signaling
                del self._events[key]
            if err:
                raise err
            return res
        else:
            ev.wait()
            # If you need result sharing, store it elsewhere (e.g., cache) after leader completes
            return fn()  # fallback: re-check cache in caller

In practice, you combine this with the cache: on miss, call single-flight to run the loader once, then re-check the cache.

Memory-conscious entry payloads

If you cache many entries, avoid storing large dicts per entry. Use tuples or small dataclasses with slots=True (Python 3.10+ supports __slots__ in dataclasses) to reduce per-object overhead.

Ruby: LRU/TTL and compact payloads

Pattern: LRU with Hash + doubly linked list

Ruby’s Hash preserves insertion order, which can be used for simple LRU-like behavior by deleting and reinserting keys. For moderate sizes, this is often sufficient and avoids implementing a linked list.

class TTLCache
  def initialize(max_entries:, ttl_seconds:)
    @max = max_entries
    @ttl = ttl_seconds
    @h = {} # key => [value, expires_at]
  end

  def get(key)
    now = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    v = @h[key]
    if v
      value, exp = v
      if exp > now
        # refresh recency: delete + reinsert
        @h.delete(key)
        @h[key] = [value, exp]
        return value
      else
        @h.delete(key)
      end
    end

    value = yield
    exp = now + @ttl
    @h[key] = [value, exp]
    evict!
    value
  end

  def evict!
    while @h.length > @max
      oldest_key, _ = @h.first
      @h.delete(oldest_key)
    end
  end
end

Step-by-step notes:

Use monotonic time for TTL to avoid issues with wall-clock changes.
Delete+reinsert updates recency without extra node objects.
Eviction by removing @h.first is O(1) for ordered Hash.

Memory-conscious Ruby values

Prefer arrays over hashes for fixed-shape payloads (e.g., [id, status, ts]).
Freeze immutable cached strings to reduce accidental duplication and to make sharing safer.
Be careful with symbol creation from untrusted/unbounded input; symbols can behave like interned identifiers and may not be reclaimed promptly in older runtimes.

Java: bounded caches, approximate sizing, and reference-aware patterns

Use a production-grade cache when possible

In Java, a well-tuned cache library (commonly Caffeine) provides high-performance eviction (W-TinyLFU), async loading, refresh-after-write, and size-based eviction with weigher functions. The key memory-conscious feature is that you can bound by entry count or weighted size and let the cache handle concurrency efficiently.

Even if you implement your own, the patterns are the same: explicit bounds, efficient eviction, and stampede control.

Pattern: size-bounded cache with a weigher

When values vary widely, use a weight function to approximate memory cost (e.g., bytes of a byte array, length of a string, number of elements). The cache evicts until total weight is under the limit.

// Pseudocode illustrating the idea; use a cache library for production.
interface Weigher<V> { int weightOf(V v); }

class WeightedCache<K,V> {
  final long maxWeight;
  long currentWeight = 0;
  final java.util.LinkedHashMap<K,Entry<V>> map;
  final Weigher<V> weigher;

  static class Entry<V> { V v; long expNanos; int w; }

  WeightedCache(long maxWeight, Weigher<V> weigher) {
    this.maxWeight = maxWeight;
    this.weigher = weigher;
    this.map = new java.util.LinkedHashMap<>(16, 0.75f, true);
  }

  synchronized V get(K k, java.util.function.Function<K,V> loader, long ttlNanos) {
    long now = System.nanoTime();
    Entry<V> e = map.get(k);
    if (e != null && e.expNanos > now) return e.v;
    if (e != null) removeEntry(k, e);

    V v = loader.apply(k);
    Entry<V> ne = new Entry<>();
    ne.v = v;
    ne.expNanos = now + ttlNanos;
    ne.w = weigher.weightOf(v);
    map.put(k, ne);
    currentWeight += ne.w;
    evict();
    return v;
  }

  void removeEntry(K k, Entry<V> e) {
    map.remove(k);
    currentWeight -= e.w;
  }

  void evict() {
    var it = map.entrySet().iterator();
    while (currentWeight > maxWeight && it.hasNext()) {
      var victim = it.next();
      currentWeight -= victim.getValue().w;
      it.remove();
    }
  }
}

Memory-conscious choices:

Use LinkedHashMap access-order to implement LRU without separate node objects.
Store primitive metadata (expiry, weight) alongside the value to avoid wrapper allocations.
Use a weigher to approximate memory usage and enforce a byte budget.

Reference-aware caching (weak/soft references)

Java offers WeakReference and SoftReference. Weak references are cleared aggressively; soft references are cleared under memory pressure. These can be useful for memoizing derived data where recomputation is acceptable. However, reference-based caches can be unpredictable: eviction depends on GC heuristics, not your policy. Prefer explicit bounds for latency predictability; use weak/soft references only for “nice-to-have” caches.

C: explicit cache structures with tight memory control

Pattern: fixed-capacity hash table + clock eviction

In C, you can build caches with very low overhead by using fixed-size arrays and storing entries in contiguous memory. This improves locality and makes memory usage deterministic. A common approach is:

Fixed-capacity hash table mapping keys to indices.
Entry array storing key, value pointer/inline value, expiry, and a reference bit.
Clock hand for eviction (second-chance): scan entries; if ref bit is set, clear it and continue; if clear, evict.

Step-by-step:

Choose a maximum number of entries and maximum value size (or store pointers to separately allocated values with a byte budget).
Define an entry struct with compact fields (use 32-bit where possible).
Use monotonic time for TTL.
On access, set the entry’s reference bit.
On insert when full, run the clock eviction loop until you find a victim.

#include <stdint.h>
#include <string.h>

typedef struct {
  uint64_t key_hash;
  uint32_t key_len;
  uint32_t val_len;
  uint64_t exp_ms;
  uint8_t  ref;
  uint8_t  used;
  // For simplicity: inline small values (fixed max)
  char key[32];
  char val[128];
} entry_t;

typedef struct {
  entry_t *entries;
  uint32_t cap;
  uint32_t hand;
} clock_cache_t;

static uint64_t now_ms();

// Pseudocode: lookup by linear probing on key_hash; compare key bytes.
// On hit: set ref=1, check exp_ms.
// On insert: if full, evict using clock hand.

static uint32_t evict_one(clock_cache_t *c) {
  for (;;) {
    entry_t *e = &c->entries[c->hand];
    uint32_t idx = c->hand;
    c->hand = (c->hand + 1) % c->cap;

    if (!e->used) return idx;
    if (e->exp_ms <= now_ms()) { e->used = 0; return idx; }
    if (e->ref) { e->ref = 0; continue; }

    e->used = 0;
    return idx;
  }
}

Memory-conscious choices:

Inline small keys/values to avoid per-entry heap allocations and fragmentation.
Use fixed-size buffers only when you can cap sizes; otherwise store pointers and maintain a separate slab/arena allocator with a byte budget.
Prefer approximate LRU (clock) to avoid pointer-heavy linked lists.

Byte-budgeted value storage with arenas

If values vary in size, store them in an arena (a large buffer) and keep offsets/lengths in entries. When evicting, you can reclaim space by using a ring buffer arena (append-only with wraparound) and invalidating old offsets when overwritten. This is a practical approach for caches where values are immutable and can be recomputed.

Cross-language techniques for memory-conscious caching

Key normalization to improve hit rate and reduce duplicates

Normalize keys so semantically identical requests map to the same entry. Examples: canonicalize URL query parameter ordering, normalize case where appropriate, round timestamps to buckets, or map equivalent IDs to a canonical form. This increases hit rate and reduces the number of entries needed for the same benefit.

Sharding to reduce lock contention

Instead of one global lock, shard the cache into N independent segments by hashing the key and selecting a shard. Each shard has its own map and eviction policy. This improves throughput in multi-threaded Python (despite the GIL, I/O-heavy apps still benefit), Ruby (depending on runtime), Java, and C.

Admission control: don’t cache everything that misses

Some keys are one-offs and will never be reused. Caching them wastes memory and increases eviction churn. Admission control techniques include:

Only cache after the second request (two-hit policy).
Use a small “probation” cache; promote to main cache on repeat access.
Reject entries above a size threshold (don’t let one huge value evict many useful small ones).

Refresh-ahead (stale-while-revalidate)

To avoid latency spikes on expiry, serve slightly stale data while refreshing in the background. This pattern is especially useful for hot keys with expensive loaders.

Step-by-step:

Store value with two timestamps: soft expiry (refresh time) and hard expiry (must not serve beyond).
On read: if before soft expiry, serve normally.
If between soft and hard expiry: serve cached value and trigger async refresh (single-flight).
If beyond hard expiry: block and reload (or fail fast with fallback).

Cache invalidation by versioning

Instead of deleting keys, incorporate a version into the key: user:123:v17. When the user changes, bump the version. This avoids complex invalidation logic and reduces race conditions. The cost is that old versions remain until TTL/eviction, so you must keep bounds tight.

Choosing what to cache: derived vs source data

Cache derived results (e.g., computed aggregates, formatted payloads) when recomputation is expensive and the derived data is smaller than the source. Cache source data (e.g., raw records) when it is reused across many derived computations. Memory-conscious design often favors caching the smallest representation that can be expanded quickly.

Practical checklist: implementing a safe, bounded cache

Define bounds: max entries and/or max weight (bytes).
Define freshness: TTL, versioning, or event invalidation.
Prevent stampedes: single-flight per key for expensive loads.
Choose eviction: LRU for temporal locality, clock for low overhead, LFU/approx LFU for stable hot sets.
Make payloads compact: avoid storing large object graphs; store compact structs/tuples/bytes.
Account for large values: size thresholds, weighting, or separate “large object” handling.
Use monotonic time for TTL decisions.
Plan for failure: decide whether to cache errors, and for how long (usually short negative caching).
Test eviction behavior: verify that memory stays within budget under adversarial key patterns.

Now answer the exercise about the content:

Which design best avoids hidden memory growth while keeping cached results reasonably fresh?

You are right! Congratulations, now go to the next page

You missed! Try again.

TTL controls staleness, but without a size or byte limit an unbounded key set can still grow memory. A size/byte bound controls memory, and combining both gives predictable behavior.