Prompt Caching Architecture for LLM Apps & Agents

Q: How is prompt caching different from KV caching?

They operate at different layers and solve different problems, even though both involve reusing computed attention state. KV caching is an internal mechanism of the inference server: during a single generation, the model stores the key and value tensors for tokens it has already processed so it does not recompute them as it generates each new token, and serving engines manage this in GPU memory with techniques like paged attention. It is about throughput within one in-flight request and is largely invisible to you as an API caller. Prompt caching, sometimes called prefix caching, is an application and API-layer feature: the provider persists the processed state of a stable prefix of your prompt across separate requests, so that when a later request begins with the exact same tokens it can skip reprocessing them and charge you a fraction of the normal input-token price. In short, KV caching makes a single request generate tokens efficiently, while prompt caching makes repeated requests that share a prefix cheaper and faster to start. You control prompt caching through prompt structure and provider settings; KV caching is handled by the serving stack.

Q: How much money does prompt caching actually save?

The savings depend on how much of your prompt is a stable, reused prefix and how often it is reused, but they can be very large for the workloads where caching applies. Reading from the cache typically costs a small fraction of the normal input-token price — on the order of ninety percent cheaper for the cached portion on major providers — while writing the cache on the first request costs the same as or slightly more than a normal request. The practical implication is that caching pays off when a long prefix is reused several times within the cache lifetime, which is exactly the situation in agent loops and multi-turn chat where the system prompt, tool definitions, and shared context are re-sent on every call. As a concrete shape: if an agent re-sends forty thousand identical prefix tokens across twenty steps, without caching you pay full input price forty thousand times twenty, whereas with caching you pay full price once to write and a fraction for the nineteen subsequent reads, often cutting total input cost by well over half. The savings shrink to nothing for one-off prompts that are never reused, so the technique is targeted, not universal.

Q: What should go in the cached part of the prompt?

The guiding principle is to place everything that is stable and reused at the front of the prompt and everything that changes per request at the end, because prompt caching matches an exact token prefix and breaks at the first differing token. Good candidates for the cached prefix, roughly in order, are: the system prompt and persona instructions, which rarely change; tool or function-call definitions, which are often very large and completely static across calls; few-shot examples that you include to steer behaviour; and large shared context such as a manual, a codebase, a policy document, or any reference material that many requests use unchanged. The variable suffix that must stay outside the cached region includes the current user message, freshly retrieved RAG chunks that differ per query, any timestamps, and per-request identifiers. A frequent and costly mistake is interpolating the current date or a request id into the system prompt, because that single change near the top invalidates the cache for everything after it. Structuring the prompt as a fixed prefix plus a small changing tail is what makes caching effective.

Q: Does prompt caching work with RAG?

It works with RAG, but only for the stable parts of the prompt, and understanding that boundary is essential. In a typical RAG request the prompt contains a system prompt, possibly tool definitions, the retrieved context chunks, and the user question. The retrieved chunks are usually different for every query because retrieval selects them based on the question, so they cannot be cached across queries and should sit in the uncached suffix. What you can cache is any large, stable context that is shared across many queries — for example, if your application repeatedly answers questions about the same single large document, contract, or knowledge base section, you can place that whole document in the cached prefix and append only the per-query question, paying the discounted cached price for the document on every call. So the rule for RAG is to separate the stable shared context, which you cache, from the volatile per-query retrieved chunks, which you do not. When most of your prompt mass is dynamic retrieval, prompt caching helps less, and you may get more from semantic response caching or from reducing how much you retrieve; when a big shared document dominates the prompt, prompt caching is very effective.

Q: How long does a prompt cache last?

Prompt caches are short-lived and expire based on inactivity, which shapes when the technique helps. On major providers the default lifetime is commonly around five minutes of inactivity, meaning the cached prefix stays warm and reusable as long as requests keep hitting it within that window, and it is evicted after a sufficiently long idle gap; some providers offer longer-lived cache tiers at additional storage cost. The consequence is that prompt caching favours bursty, high-frequency traffic that reuses the same prefix repeatedly in a short span — an active agent loop, a live chat session, or a batch of related requests — and benefits little from traffic that is spread far apart in time, because each request after a long idle gap pays the full uncached price and re-writes the cache. This is also why caching a prefix that is used only once is wasteful: you incur the cache-write cost and then the entry expires before any cheaper read occurs. When designing for caching, group related requests in time where you can, and treat the TTL as a real constraint when estimating savings rather than assuming the cache persists indefinitely.

Q: Is prompt caching the same as semantic caching?

No, they are different techniques that target different opportunities and are best used together. Semantic caching stores the final response to a query and serves it again when a new query is semantically similar, as judged by embedding similarity; it can skip the model call entirely, which is powerful for workloads with many repeated or near-duplicate questions, but it risks returning a stale or subtly mismatched answer if the similarity threshold is too loose. Prompt caching does not store or reuse answers at all; it stores the processed state of an exact prompt prefix so that re-sending that prefix is cheaper and faster, while the model still runs and produces a fresh response for the variable part of the prompt. Because prompt caching is exact-match on the prefix, it never returns a wrong answer the way an over-eager semantic cache can; it simply reduces the cost and latency of the input processing. The two compose well: use semantic caching to avoid calling the model for repeated questions, and prompt caching to make the calls you do make cheaper when they share a large stable prefix. Confusing them leads teams to expect answer-level savings from prompt caching or exactness from semantic caching, which neither provides.

Q: Where should I implement prompt caching in my architecture?

The provider performs the caching, but the responsibility for structuring prompts to benefit from it should live in a centralised place rather than scattered across every caller, and for multi-model or multi-tenant systems that place is usually the AI gateway or a shared prompt-assembly layer. Centralising prompt construction means the static prefix — system prompt, tool definitions, shared context — is assembled consistently and identically on every request, which is exactly what keeps the cached prefix stable and the hit rate high; if each service hand-builds prompts slightly differently, prefixes diverge and caches miss. A gateway is also the natural point to set and tune cache settings such as breakpoints and TTL tiers, to enforce that volatile content stays in the suffix, and to collect the observability that tells you whether caching is working, namely cache-hit rate and the split between cache-write, cache-read, and uncached tokens. Implementing it centrally also makes it easy to apply caching uniformly across many features and to change policy in one place. For a single simple application you can manage prompt structure inline, but as soon as you have multiple callers or models, push prompt assembly and cache policy into the shared gateway layer.

Q: What breaks a prompt cache without you noticing?

The most common silent cache-buster is putting variable content early in the prompt, because caching matches an exact token prefix and invalidates everything after the first differing token. Interpolating the current date, a timestamp, a request id, a user name, or any per-request value into the system prompt or near the top of the prefix will cause a cache miss on essentially every call, even though the prompt looks almost identical to a human. Another subtle one is reordering or editing the cached region between releases — changing the wording of the system prompt, adding or reordering tool definitions, or reformatting the shared context all change the prefix tokens and bust the cache, so you should treat the cached prefix as a versioned, stable contract and be aware that a deploy can reset your hit rate. Non-determinism in how the prompt is serialised, such as a tools list whose order is not stable, has the same effect. Finally, simply letting too much idle time pass lets the entry expire under the TTL. The defence is to keep all volatile values in the uncached suffix, assemble the prefix deterministically, and monitor cache-hit rate so a regression shows up immediately rather than as a quiet cost increase.