Context Engineering for Production LLM Agents (2026)

Q: What is the difference between context engineering and prompt engineering?

Prompt engineering is the craft of choosing the words and instructions that make a model behave on a single call — a writing problem. Context engineering is the systems problem of deciding, on every turn of an agent loop, exactly which tokens from a much larger universe of memory, retrieval, tool definitions, and history enter the fixed context window, in what order, and at what fidelity. Prompt engineering optimizes one message; context engineering optimizes the dynamic assembly of the whole window under a token and latency budget, which is closer to cache management and query planning than to copywriting.

Q: Why does adding more context sometimes make an LLM agent worse, not better?

Because attention is finite and unevenly distributed. The "Lost in the Middle" effect shows models attend most reliably to the start and end of their context and degrade in the middle; newer long-context models soften this but do not eliminate it. Past a workload-specific threshold each added token dilutes attention, invites distraction, raises latency, and increases cost linearly. A curated 6-chunk window routinely beats a 40-chunk one on both accuracy and price. The window is a budget to allocate, not a bucket to fill.

Q: How should I allocate a context window budget across competing needs?

Fix a per-turn budget (for example 32K tokens even on a 200K-capable model) and cap each slice rather than letting any slice grow best-effort: roughly 5–10% system and policy, 5–15% only the currently relevant tool schemas, 10–20% durable memory, 20–35% reranked retrieved knowledge, 25–40% recent working history, and a reserved ~10% headroom for the model output. When a slice wants more than its cap, the pipeline reranks harder and drops the tail rather than borrowing from headroom, so the window stays predictable and cacheable.

Q: What is context compaction and when should it trigger?

Compaction is what keeps a long-running agent loop from either overflowing the window or forgetting its task. When accumulated history crosses a high-water mark — about 75% of budget, leaving room to run the compaction call itself — the older turns are summarized into a structured recap (original goal, decisions and why, artefacts touched, open threads, stated constraints) and the raw turns are replaced by that recap. Keep the last 3–5 turns verbatim alongside it, because the most recent tool outputs hold the immediate next step. Never simply truncate the oldest messages; that deletes the task goal.

Q: How does prompt caching change context-engineering design?

Prompt caching (offered by Anthropic, OpenAI, and Google) lets you keep a stable prefix — system prompt, tool definitions, durable memory — cached across turns so you pay full price only for the delta. Cached input tokens cost roughly an order of magnitude less than fresh ones (often around $0.30 per million versus $3 per million for a mid-tier model) and skip re-processing latency. The architectural consequence is decisive: put stable, cacheable content in a contiguous prefix and volatile content (latest turn, fresh retrieval) in a suffix. A window reshuffled every turn never caches and silently multiplies cost.

Q: What is the boundary between context engineering and agent memory?

Agent memory is the persistent store — episodic, semantic, and procedural — that holds durable facts, decisions, and learned procedures across sessions. Context engineering is the per-turn assembly layer that recalls the relevant subset of that memory into the window by relevance to the current intent. Get the boundary wrong and you either bloat every window with facts the turn does not need, or you lose facts the moment they scroll out of raw history. Memory decides what is rememberable; context engineering decides what is present right now.

Q: How do I measure whether my context pipeline is working?

Because context assembly is deterministic, it is directly testable. Instrument token-budget adherence (p50/p95 assembled window size versus the cap), context precision and recall against a labelled set (did the window contain what was needed without burying it in noise), prompt cache hit rate (below ~60% on a multi-turn agent signals over-aggressive reordering), compaction fidelity (a probe such as "what was the original goal?" must still pass after compaction), and cost and time-to-first-token per turn. Treat a context-precision regression as a release blocker, exactly like a failing test.

Q: What are the most common context-engineering anti-patterns?

The recurring failures are: the kitchen-sink window that stuffs everything "just in case" (slower, costlier, less accurate); truncate-the-oldest, which deletes the task goal instead of compacting; reordering the window every turn, which defeats prompt caching and multiplies cost; pasting raw multi-thousand-token tool outputs instead of projecting the fields the agent needs; abstractive compaction with no faithfulness gate, which hallucinates a decision the user never made; sharing one window across sub-agents so a research agent's scratchpad pollutes the orchestrator; and carrying all tool schemas every turn instead of selecting the relevant handful.

Satyam Kumar

ブログに戻る

ai-architecture

Context Engineering for Production LLM Agents

By Satyam KumarJune 23, 202611 min read

context engineering llm agents prompt caching token budget context window agent memory rag context context compaction long context llm latency llm cost optimization agent architecture retrieval augmented generation ai architecture patterns production llm context assembly 2026

Context Engineering for Production LLM Agents

Frequently Asked Questions

この記事を共有する

Twitter LinkedIn WhatsApp

Satyam Kumar

Founder & AI Architect, AppScale LLP

AI＆クラウドアーキテクト。数百万人にスケールするシステム構築を支援。

LinkedIn GitHub

Context Engineering for Production LLM Agents

Frequently Asked Questions

この記事を共有する

Comments

Leave a comment