KV-Cache Engineering: Paged Attention, Prefix Cache, Disaggregation (2026)

Q: Why is the KV-cache the dominant cost in LLM inference?

The KV-cache stores the per-token, per-layer, per-head key and value tensors that autoregressive attention requires to avoid recomputation. Its size scales linearly with sequence length and with concurrent requests, while model weights are loaded once and shared across all requests. For Llama-3 70B at 32k context in bfloat16, each request occupies 10.7 GB of KV-cache; at concurrency of 64 with 8k average context, the aggregate cache is ~170 GB — more than two full H100 80GB GPUs worth of HBM just for cache. Additionally, the decode phase is bandwidth-bound: it reads the entire cache from HBM on every token, so HBM bandwidth, not compute, sets the throughput ceiling. Together these properties make KV-cache management the central determinant of production inference economics — more so than model choice or prompt engineering.

Q: What is paged attention and why did vLLM popularise it?

Paged attention divides the KV-cache into fixed-size blocks (typically 16 or 32 tokens per block) and gives each request a block-table that maps logical sequence positions to physical block indices — exactly the way an operating system maps virtual addresses to physical memory frames. Before paged attention, serving stacks reserved contiguous HBM for each request's full sequence-length budget, producing 60-80% internal fragmentation and capping effective throughput at 2-4× below the GPU's capability. Paged attention eliminates this fragmentation, raising aggregate cache utilisation to 80-95% and throughput proportionally. vLLM popularised the technique because it was the first open-source serving stack to implement it correctly and to ship it with kernels (PagedAttention) fast enough that the block-table indirection overhead was negligible. Every serious 2026-era serving stack (vLLM, SGLang, TensorRT-LLM) now implements paged attention as baseline.

Q: How does prefix caching work and what hit rates are achievable?

Prefix caching exploits the fact that the KV-cache for any given prefix is deterministic — the same input tokens always produce the same KV vectors. The serving stack computes a cumulative content hash for each block in the page pool, maintains a global hash-to-physical-block lookup, and when a new request arrives walks its prefix block-by-block matching against the cache. Matching blocks are reused (the new request's block-table points at existing physical blocks); only the first non-matching block triggers actual prefill compute. For production RAG workloads with fixed system prompts and deterministic retrieval, hit rates of 70-85% are routine and 90%+ is achievable with careful engineering. The wins are large — 4-6× reduction in prefill compute, 3-5× improvement in time-to-first-token. The implementation cost is minimal and the ROI is the highest single optimisation available on most LLM serving deployments.

Q: Why disaggregate prefill and decode onto separate GPU pools?

Prefill is compute-bound: it processes the entire input prompt in a single forward pass with high arithmetic intensity and reaches >90% GPU utilisation easily. Decode is bandwidth-bound: it generates output one token at a time, each step reading the full KV-cache from HBM with low arithmetic intensity, struggling to reach 30% compute utilisation. When colocated on the same GPU under load, the two phases interfere destructively: prefill spikes block decode requests in the queue (latency goes up), and decode's constant HBM bandwidth pressure starves prefill of throughput. Disaggregating onto separate pools lets each side run at its optimal configuration — prefill on compute-rich GPUs with large prompt batches, decode on HBM-rich GPUs with large sequence batches and continuous batching. The trade-off is that the KV-cache must be transferred between pools (~11 ms over NVLink, ~50 ms over RDMA), which constrains the network topology of the deployment. The throughput gains under load are typically 1.5-2.5× compared to colocated serving.

Q: What is cross-layer KV sharing and how does it affect serving stacks?

Cross-layer KV sharing structures the model into groups of consecutive transformer layers that share a single KV projection — instead of every layer maintaining its own independent KV-cache, every layer in the group reuses the same cached keys and values. A 36-layer model with 4-layer groups stores KV for only 9 logical groups, a 4× cache reduction. Combined with grouped-query attention (8× KV reduction across heads), the total reduction relative to a vanilla MHA baseline reaches 32×. The serving-stack implications are non-trivial: the paged-attention block-table becomes a two-level structure (each logical layer has a shared-with pointer indicating its group), the attention kernel must dereference the pointer on every read, the eviction granularity becomes the layer group, and the prefix-cache cumulative hash must be computed per group rather than per layer. Stacks that adopted Gemma 4 (the canonical 2026 cross-layer-sharing model) without explicit support lost 30-60% of effective prefix-cache hit rate temporarily. vLLM 0.7 and SGLang 0.4 shipped layer-group-aware prefix caching in 2026 H1; verify support before adopting cross-layer-sharing models in production.

Q: What is non-uniform tensor parallelism and why does it matter for Laguna XS.2?

Standard tensor parallelism assumes every transformer layer has the same number of attention heads, so each TP rank gets the same number of heads at every layer. Laguna XS.2 deliberately varies the head count per layer (32 in early layers, 64 in middle, 96 in late layers) because the optimal attention width varies with depth — accuracy gains of 2-4 percentage points on long-context reasoning are reported. The serving consequence is that the head-symmetric TP layout no longer applies: an 8-way TP cannot give every GPU 4 heads at layer 0 and 12 heads at layer 12 cleanly, so KV-cache size per GPU varies by layer, activation tensor shapes change layer-by-layer, and the FlashAttention fast paths fall back to slower general-case kernels (20-40% slower per attention op). Production deployments use a layer-grouped TP plan that varies topology with depth, keeping heads divisible across each layer group's GPU set. FlashAttention-3 added a non-uniform-head fast path in mid-2026 that recovers most of the throughput loss; the lesson is that serving stacks must now be treated as first-class engineering surfaces with explicit kernel-version pinning and regression testing across model versions.

Q: How do CSA and HCA reduce KV-cache cost in DeepSeek V4?

Compressed Sequence Attention (CSA) projects the KV-cache for a window of tokens into a lower-dimensional summary and stores only the summary, recovering approximate attention at decode time. Hierarchical Compressed Attention (HCA) does this at multiple scales — fine-grained for recent tokens, coarse-grained for distant tokens — producing logarithmic decay in cache size with sequence length rather than the usual linear growth. The combined CSA+HCA reduces KV-cache by ~90% and total inference FLOPs by ~73% at 1M-token context, with only 1-2 percentage points of model-quality cost on retrieval and reasoning benchmarks. The serving math is transformative: a 670B-parameter V4 model at 1M context fits in ~300 GB of HBM (4 H200 GPUs) rather than ~3 TB (40 H200 GPUs). The kernel landscape is the gating factor — SGLang shipped CSA support in August 2026, HCA is in development; TensorRT-LLM has CSA, HCA on Q4 2026 roadmap; vLLM supports both but the kernels are 15-25% slower than SGLang. For V4-class models, the choice of serving stack is now constrained primarily by feature support rather than schedulability.

Q: How does speculative decoding interact with KV-cache compression?

Speculative decoding accelerates the decode phase by running a small draft model that proposes K tokens at once which the target model then verifies in parallel. The draft model typically shares the target's KV-cache (cache-shared draft heads in EAGLE-3, for example) to avoid doubling cache memory and to keep the draft's predictions in distribution with the target. When the target uses CSA/HCA compression, the cache layout is non-standard and the draft must either use the same compressed cache (requiring matching kernel implementations on both sides) or maintain its own uncompressed cache (doubling memory pressure and defeating the compression). The clean production setup uses CSA in both target and draft with matching kernel paths; SGLang supports this since Q3 2026 and the gains stack — CSA+HCA gives 10% cache and 27% FLOPs at 1M context, and speculative decoding adds 1.7-2.3× decode speedup on top. See the speculative decoding article for the full draft-target pairing analysis.

Q: What are the most common KV-cache anti-patterns in production?

Five recur. (1) Treating the serving stack as a static commodity — adopting vLLM or TensorRT-LLM at one version and never upgrading, missing 1.5-3× throughput improvements available in each two-quarter window. (2) Colocating prefill and decode under load when disaggregation would deliver 1.5-2.5× more throughput on the same hardware. (3) Not measuring prefix-cache hit rate at the application level — the serving-stack internal metric is the wrong one; application-level instrumentation catches regressions when retrieval determinism breaks or prompt templates introduce non-determinism. (4) Ignoring the kernel-support matrix when adopting a new model — dropping in Gemma 4 or DeepSeek V4 without verifying kernel support gets you the slow general-case kernel path (2-5× slower than the optimised path). (5) Holding KV-cache for inactive sessions in HBM rather than offloading to CPU memory or evicting with content-addressed re-prefill on resume. The common thread is that KV-cache engineering rewards proactive investment — every one of these failure modes is preventable with the right instrumentation and discipline up front.

Q: How much cost improvement is achievable from full-stack KV-cache engineering?

The compounding is dramatic. A naive baseline (Llama-3 70B on a single 8-GPU node with contiguous KV-cache, colocated prefill-decode, no prefix cache, no spec-decode) serves perhaps 800 tokens/sec aggregate at $40/hour, costing $1.39 per million tokens. Adding paged attention raises throughput to 2400 tokens/sec at the same cost: $0.46/million. Adding prefix caching at 75% RAG hit rate cuts effective prefill cost by 4×: ~$0.28/million. Adding speculative decoding cuts decode time by ~2×: ~$0.17/million. Switching to a 2026-generation cross-layer-sharing model with the same parameter budget cuts cache by 3× and raises sustainable concurrency by ~2.5×: ~$0.07/million. The end-to-end improvement is ~20× on the same physical hardware. These numbers are not theoretical — teams that have implemented the full stack report exactly these ranges. The reason production LLM economics often disappoint is not that the technology is too expensive, but that the serving stack is under-engineered relative to what is achievable today.