Game AI Architecture: Procedural Quests and LLM NPC Dialogue (Budget Models, 2026)

Q: Why does game AI break the assistant-AI playbook in 2026, and what changes architecturally?

Three structural assumptions break in ways that compound. Latency budget: a 60-fps game has 16.67ms per frame and a 30-fps console game has 33.3ms; the time-to-first-token for an NPC dialogue exchange has to be inside the 200-400ms budget players perceive as natural for spoken dialogue with real-time animation, against the assistant-AI assumption that sub-second is responsive enough. The frontier-model API round-trip from a console in Warsaw to a US-East data centre is regularly outside this budget; the architecture has to put inference where the budget supports it (on-device or near-edge). Cost budget: the publisher's sustainable per-session budget is bounded in tens of euro-cents across a hundred-hour playthrough — two-to-three orders of magnitude tighter than assistant-AI economics; cost-per-session is a first-class engineering metric. Adversarial surface: the player population includes the speedrun community, the modding community, the streaming community whose audience reward includes any moment the system fails entertainingly, and the adversarial-prompting community whose explicit intent is to jailbreak the model on stream. The architecture cannot trust the player input or the player's system-of-record (save file, platform telemetry, network packet); the prompt-injection defence stack is part of the game architecture, not an enterprise add-on. The architecture that survives these constraints is multi-tier inference, state-machine-augmented dialogue, skeleton-authored quests, multi-layer defence stack, and per-session cost engineering.

Q: How do the three model tiers (on-device, edge, cloud) split responsibility and what fine-tuning does each require?

Tier 1 on-device small model: 1-3B parameter model quantised and fine-tuned for the game's dialogue style, running on the player's GPU (consoles ship inference-capable GPUs since current generation; PC players have inference-capable discrete graphics). Latency sub-150ms for 30-50 token responses; cost approximately zero (the player owns the hardware). Handles routine NPC dialogue: greetings, contextual reactions, ambient barks, repeated quest-stage reminders. The fine-tuning is the key engineering investment — vanilla small models produce dialogue that breaks immersion, well-fine-tuned small models produce dialogue indistinguishable from the writers' output for routine cases. Tier 2 edge mid-size model: 7-13B parameter model running on edge inference nodes geographically close to player population (Warsaw for Central European, Frankfurt for Western European, Virginia for US-East, Tokyo for Japan, São Paulo for Brazil). Latency 150-400ms; cost fractions of a euro-cent per call. Handles dialogue requiring more sophistication than the small model — key NPC responses to complex player input, quest conversations requiring narrative continuity, companion-character interactions requiring memory of recent events. The edge deployment is the engineering commitment: model-serving optimisation (vLLM, TensorRT-LLM, SGLang), capacity planning for player-population peak load. Tier 3 cloud frontier model: Anthropic Claude, OpenAI GPT, Google Gemini, or self-hosted equivalent via cloud API. Latency 500ms-2s; cost 1-10 euro-cents per call. Handles rare conversations the lower tiers cannot — flagship NPC dramatically-pivotal responses, player-driven creative prompts the writer team scoped for frontier-model handling. The deliberate-thinking animation covers the latency.

Q: How does the LLM augment the classical dialogue state machine without taking over from the writers' room?

The classical state machine remains the canonical structure (writers' authority over narrative beats preserved, QA tractable, localisation pipeline workable), with the LLM at three integration points. Per-state response generation: the LLM generates surface dialogue for a given state given the state's narrative intent, the NPC's personality, the player's recent context, and the writer-authored prompt with 5-10 style examples per NPC drawn from the writers' room canon; the post-generation guardrail checks for canon-violation, tone-violation, and safety-violation. Player-input interpretation: the player's natural-language input is interpreted by the LLM into the state-transition vocabulary the state machine recognises, turning the system from one-of-N dialogue choices into open-ended player input while maintaining state-machine-friendly interpretation; runs on the on-device small model fine-tuned on the player-input corpus. Procedural in-fill of state content: the state machine has slots scoped for procedural in-fill (NPC mentions a recent event, references a player choice, adapts to observed playstyle); the LLM in-fills within writer-defined templates that ensure the in-fill stays within the narrative envelope. The pattern preserves production discipline while delivering the dialogue capability the LLM enables; the player experiences the NPCs as more alive without the writers' room losing authority over the narrative.

Q: How does the procedural quest system use LLM in-fill without the LLM generating the quest skeleton?

The quest skeleton (objective structure, spatial layout, reward sequence, narrative beats) is authored or procedurally generated by a system the writers' room governs; the LLM does not generate the skeleton. The skeleton-authority constraint preserves the design discipline (difficulty curve, pacing, reward economy, narrative integration) and keeps QA tractable; teams that let the LLM generate the skeleton produce quests whose engagement metrics decline against the authored baseline as the LLM's narrative-arc weakness compounds across quest length. The in-fill template is the engineering artefact: each quest skeleton type has a template defining slots (quest name, description, NPC dialogue at each node, in-game documents, dialogue alternates), constraints per slot (canon, character-limit, localisation, safety), and writer-supplied few-shot examples that anchor the model's style. Template engineering is the heart of the architecture; teams that under-invest in templates produce quests whose surface content is the LLM's default style rather than the game's authorial voice — player perception becomes "AI slop" rather than "personalised content". The narrative-coherence eval is the gate: the eval pipeline samples the corpus and assesses narrative coherence, canon respect, difficulty curve, and player-engagement projection; the rejection-rate metric is the design-system's feedback loop. The composition with the dialogue system: quest NPC dialogue is generated by the dialogue system's tier-router, narrative continuity is maintained by agent-memory pattern, safety is enforced by the guardrail stack.

Q: What does the multi-layer content-safety and prompt-injection defence stack look like for a game?

The defence stack is wider than assistant-AI: it has to enforce the union of platform-policy filters (Sony, Microsoft, Nintendo, Valve content standards), regional regulatory constraints (Germany USK, Australia OFLC, Brazil ClassInd, China NPPA), and the publisher's own brand-and-IP standards. Six-layer adaptation. Pre-LLM input filter normalises player input and screens for known injection patterns (system-prompt-override attempts, known-jailbreak elicitations, meta-instruction injection through the dialogue system); the on-device small model runs the screening with a millisecond budget. Pre-LLM intent classifier categorises input intent (genuine in-game dialogue, mechanical exploit attempt, social-engineering attempt, off-topic chatter) and routes to appropriate handler. Prompt-template hardening uses delimiters and role separation to keep player input distinguishable from system instruction in the model's context. Post-LLM output classifier checks response against canon, tone, safety, and platform-policy constraints. Output firewall filters response for egress constraints (no PII leakage, no policy-violating content, no canon-breaking dialogue). Save-file and network-packet attack surface: save files and multiplayer network packets are player-controllable surfaces historically treated as trusted by the game state machine; the LLM-augmented architecture treats them as untrusted input screened by the same pre-LLM defence stack. Calibration is the player-experience tradeoff — too conservative produces unresponsive NPCs, too liberal permits jailbreak content the streaming community will showcase within launch week; the eval pipeline runs against the adversarial-input corpus the QA team curates.

Q: How is the per-session cost budget engineered, and what does budget-aware degradation look like?

The publisher establishes the per-session budget (typical: a few euro-cents to tens of euro-cents per hour of play, depending on the title's economics — a premium narrative-driven title can sustain higher than a free-to-play title); the architecture enforces the budget through the tier-router, the cache, and budget-aware degradation. The router's budget-awareness is the primary cost control: the router maintains running spend per session and shifts tier selection toward lower-cost tiers as session-spend approaches budget; the player's experience is the gradual shift from frontier-model-quality dialogue early in the session to small-model dialogue late, calibrated to be imperceptible at the population level. Player-class differentiation pairs with budget-awareness (a free-tier player has tighter budget than a paid-DLC player; competitive-multiplayer differs from single-player), governed by the publisher's monetisation policy. The cache hit-rate optimisation is the secondary cost control: hit rate above 30% for routine dialogue and above 60% for high-frequency player inputs (greetings, ambient barks, common quest reminders) is achievable with cache engineering — semantic indexing, NPC-and-scene-aware caching, periodic cache refresh as dialogue style evolves. Cache hit metric reviewed weekly during live-service period; cache hit rate decay is the leading indicator of dialogue style drift. Budget-aware degradation handles the tail case where session-spend hits budget: degradation moves dialogue to small-model-only tier with writer-authored fallback templates as the surface; player experience is reduced but not broken; degradation is logged and the player-cohort whose sessions reach the budget is reviewed (the cohort indicates either a budget under-set, a cache under-tuned, or a player behaviour the design did not anticipate).

Q: What does the cache architecture look like, and how do you sustain a high hit rate as dialogue style evolves through live-service updates?

The cache layer in front of the router handles the high-fraction repeat case. Player dialogue inputs cluster sharply — the top thousand inputs cover a meaningful fraction of total inputs across the player population — and the dialogue-response semantic cache returns the previously-generated response without invoking the model. The cache is a vector store indexed on the player input embedding plus the NPC identity and the scene context (the same input from the same player in different scenes can warrant different responses); the lookup runs in single-digit milliseconds. Cache hit rate above 30% is achievable for routine NPC dialogue with a well-tuned cache; above 60% for high-frequency cases (greetings, ambient barks, common quest reminders); cost savings are commercially meaningful at these rates. Sustaining the hit rate through live-service updates requires three disciplines. The cache-refresh policy: when the dialogue style evolves (new NPC writers join, the writers' room iterates the style guide, the localisation team refreshes a regional voice), the affected cache partitions are invalidated and rewarmed with the new generation; the rewarming is staged across the player population to avoid the inference-load spike. The cache-eviction policy: stale entries are evicted on the cache size budget, with the recency-and-frequency weighting tuned for the dialogue access pattern. The cache-quality monitoring: the cache hit rate is reviewed weekly with the per-NPC and per-scene breakdown; the rate decay indicates either dialogue style drift, player-input distribution shift, or cache-engineering regression — each requires a different intervention. The cache is not a write-once asset; it is a continuously-managed component of the live-service operation.

Q: How does the Polish game-development scene's context (CD Projekt Red, Techland, 11 bit studios, People Can Fly, Bloober Team) shape the architecture choices?

The Polish game-development scene operates with several context-specific factors that shape architecture choices for the studios in the cluster. The talent base: Warsaw, Krakow, Wroclaw, and Gdańsk have deep concentrations of game engineering talent including the inference-engineering, ML, and live-service operations capabilities the LLM-driven architecture requires; the talent availability supports Stage 3 ambitions that other geographies struggle to staff. The publisher economics: the leading Polish studios have a strong premium-narrative-title heritage (CD Projekt Red's Witcher and Cyberpunk franchises, Techland's Dying Light, 11 bit's This War of Mine and Frostpunk, Bloober Team's Silent Hill 2 remake) where the per-session cost budget can sustain LLM-driven dialogue and quest systems; the architecture choices here serve the premium-narrative segment differently from the free-to-play segment. The infrastructure proximity: AWS Frankfurt, Azure West Europe, Google Cloud Warsaw, and the Polish national cloud capacity provide low-latency edge inference for the European player population; the Tier 2 edge architecture has good infrastructure options. The regulatory context: Polish data-protection law (UODO, the Polish DPA) and the EU AI Act framework apply; the dialogue-system data flows (player input, NPC response, telemetry) are designed to comply with both. The cultural-export angle: Polish-developed titles are major cultural exports; the Polish-language dialogue quality (with the small model fine-tuned for Polish phonology, grammar, and idiom) is a differentiation the studios can lean into. The community ecosystem: GIC Poznań, Digital Dragons in Krakow, and the broader European games community provide forums for the architecture patterns to be contributed back. The studios that operate this architecture are positioned to set the next-generation reference.