The Retrieval Cache Hierarchy for Production RAG (2026 Architecture)

Q: Why is a single response-level Redis cache not enough for production RAG?

A single response cache misses every interior caching opportunity (embedding, BM25 postings, dense ANN graph, cross-encoder rerank) and inherits four specific failure modes that the stratified hierarchy avoids. The first is cache-key collapse under semantic similarity: "what is our refund policy" and "how do refunds work" should plausibly hit the same response but a string-equality key treats them as distinct, a naive normalised key collapses some but loses information on adversarial inputs, and a vector-similarity key introduces its own correctness risk where similar-but-not-identical queries surface wrong cached answers. The second is hidden non-determinism in the prompt: hybrid retrieval with reciprocal rank fusion produces different top-k orderings across runs when BM25 and dense scores are close, the same query at 10:01 and 10:02 produces a prompt where chunk-C appears before chunk-A, and the hash-of-prompt cache key misses; the cache reports a 12 percent hit rate when the real intra-day repeat-query rate is closer to 35 percent. The third is multi-tenant key leakage: the response cache key has to include the tenant scope (otherwise Tenant A’s answer surfaces in Tenant B’s session) and the access-control scope (otherwise a user who lost permission to a document still gets cached answers grounded in it); response-layer caches built without these scopes are data-exfiltration incidents waiting to be filed. The fourth is invalidation that lives at the wrong granularity: when a single document changes, the response cache for every query that retrieved that document is stale, the dense ANN graph for the embedding of that document is stale, the BM25 posting lists for terms in that document are stale, and the cross-encoder rerank cache for any (query, candidate) pair involving the document is stale; single-level invalidation reaches none of these interior tiers. The fix is not to remove the cache but to design it tier by tier to match the retrieval pipeline.

Q: How should I derive the cache key at each tier, and why does the canonical-query-hash discipline matter?

The canonical query hash is computed once at the start of the pipeline and propagated through every tier, instead of re-normalising at each tier (which is a maintenance trap that produces silently divergent key spaces). The canonicalisation includes lowercasing the query, collapsing whitespace, optionally stripping leading and trailing punctuation, and explicitly not stripping semantically meaningful punctuation like question marks; the canonicalisation is a workload-specific calibration decision not a default because over-aggressive normalisation (stripping diacritics on multilingual workloads, stripping numbers on numeric workloads, collapsing case on case-sensitive identifiers) produces wrong hits. At Tier 1 the key is sha256(model_id + model_version + canonical_query); the model_version is non-negotiable because re-deploying the embedding model changes every vector in the index and the cache must invalidate atomically with the index swap. At Tier 4 the key is (query_hash, candidate_id, model_id, model_version) where the model_version is the reranker model version; the same safety belt as Tier 1. At Tier 5 the key has to encode every input that affects the answer: canonical query hash, tenant identifier (so Tenant A’s responses do not surface in Tenant B’s session), access-control-list hash computed from the user’s effective permissions at query time (so a user who lost permission does not get a cached answer grounded in a forbidden document), retrieved document revision hashes (so document changes invalidate the cached answer), prompt-template version (so prompt-engineering changes invalidate atomically), and model identifier and version (so model upgrades produce a fresh cache rather than mixing old and new outputs). The discipline pays back in three places: prompt-template iteration does not thrash the cache, model upgrades are safe to roll out and roll back, and multi-tenant isolation survives security review.

Q: How do I handle cache invalidation when documents change, and why is TTL alone insufficient?

TTL-only invalidation is the source of most production cache misery: document changes are not reflected until the TTL expires, so for an hour-granularity TTL on a daily-document-churn corpus the system serves stale answers for roughly half the working day on average. The correct pattern is event-driven invalidation via a document-change topic. The document ingestion pipeline emits a `document_changed` event on a Kafka or SNS topic when a document is added, updated, or deleted; the event carries the document_id and the new document_revision_hash. Each cache tier subscribes to the topic and invalidates the entries that reference the changed document. Tier 1 (query embedding) ignores the event because query embeddings do not depend on document content. Tier 4 (rerank) deletes every cache entry where the candidate_id equals the changed document_id. Tier 5 (response) deletes every cache entry whose retrieved_doc_revs list includes a stale document_revision_hash. The implementation requires the cache key shape to support efficient invalidation by document_id, which usually means maintaining a reverse index (document_id → list of cache keys that reference it) alongside the primary cache; the reverse index lives in the same Redis cluster or a separate store, and the operational cost is one extra Redis write per cache write and one Redis range query per invalidation event. The production-grade default is event-driven invalidation plus a moderate TTL as a safety belt against missed events: the TTL is the floor on staleness when the event bus drops a message or the consumer falls behind, and the event-driven path is the primary mechanism. The freshness-budget alternative (set a low TTL and accept stale results during the TTL window) is simpler and works when the document-change rate is low and the staleness tolerance is high; the hybrid is the production-grade default and the only choice for high-stakes workloads in medical, legal, or regulatory domains.

Q: What is the realistic cost saving from the cache hierarchy on a customer-support RAG workload at scale?

A worked example at 5M queries per day with a typical hybrid pipeline gives the magnitude. Without caching, the cost breakdown at 2026 prices is approximately: embedding at 5M queries times $0.08 per million tokens times 8 tokens per query equals $3.20 per day; cross-encoder rerank at 5M queries times 100 candidates times $0.0001 per pair on small batched GPU inference equals $50,000 per day; LLM generation at 5M queries times $0.015 per query equals $75,000 per day; total approximately $125,000 per day or $3.75M per month, with BM25 and dense retrieval amortising into modest fixed infrastructure cost. With the cache hierarchy applied at typical hit rates for this workload (75 percent embedding hit rate, 35 percent rerank hit rate, 22 percent response hit rate), the cost becomes: embedding at 5M times 0.25 times $0.08/M times 8 equals $0.80 per day (saving $2.40); rerank at 5M times 100 times 0.65 times $0.0001 equals $32,500 per day (saving $17,500); LLM at 5M times 0.78 times $0.015 equals $58,500 per day (saving $16,500); cache infrastructure cost (Redis Cluster plus observability) approximately $2,000 per day; total approximately $93,082 per day or $2.79M per month. The saving is roughly $960K per month on a $3.75M baseline, a 26 percent reduction, with most of it from the rerank and response tiers. The hit rates assumed are conservative for a customer-support workload with concentrated query distribution; high-frequency consumer chat workloads with FAQ-like queries see 50–70 percent response cache hit rates and corresponding cost savings of 40–60 percent. The latency saving is harder to quantify but is the more important benefit on user-facing workloads: a cache hit at Tier 5 returns in sub-10ms versus sub-3s for cold-path generation, which is the difference between an interactive UI and a wait-with-spinner UI.