Skip to content
Blog

Insights d'ingénierie

Analyses approfondies des systèmes d'IA, de l'architecture cloud, des systèmes distribués et du leadership en ingénierie.

Reasoning LLM Models in Production: o-Series, DeepSeek-R1, Claude Extended Thinking — Architecture, Routing, and Cost (2026)
ai-architecture1 min read

Reasoning LLM Models in Production: o-Series, DeepSeek-R1, Claude Extended Thinking — Architecture, Routing, and Cost (2026)

Reasoning models — o-series, DeepSeek-R1, Claude extended thinking, Gemini deep-think, Qwen QwQ — are the most consequential 2026 LLM development and the most over-applied. Naively routing all traffic through them costs 5-50x more than necessary, adds 5-60 seconds of latency before any visible output, and on a meaningful subset of workloads produces worse answers than the same model family without reasoning. This article is a practitioner reference: what reasoning models actually are (hidden chain-of-thought tokens), the hidden-token cost model with long-tail distributions, when reasoning helps vs hurts vs adds no value, the tiered routing pattern with verifier escalation that delivers frontier quality at workhorse prices, the streamed-reasoning UX patterns, and the eval disciplines that catch regressions on model upgrades.

May 23, 2026Read
1M-Token Context Windows in Production: Long-Context LLM Architecture vs RAG vs Hybrid (2026)
ai-architecture1 min read

1M-Token Context Windows in Production: Long-Context LLM Architecture vs RAG vs Hybrid (2026)

1M-token context windows are real, useful, and economically viable in 2026 — but only as one tool among several in a hybrid architecture. This article is a practitioner reference for production architecture decisions around long-context: cost math at 1M (dominated by prefill and KV-cache, not output), the well-documented quality failures (needle-in-haystack vs in-distribution distractors, lost-in-the-middle, attention sink, position-bias drift), the decision framework for choosing long-context vs RAG vs hybrid, and the operational instrumentation that catches regressions before customers do. The teams that treat long context as a RAG replacement spend 5-20x more per query than they need to; the teams that treat it as the synthesis layer on top of a retrieval-narrowed working set get the best of both worlds.

May 23, 2026Read
KV-Cache Engineering for LLM Inference: Paged Attention, Prefix Cache, and Prefill/Decode Disaggregation (2026)
ai-architecture1 min read

KV-Cache Engineering for LLM Inference: Paged Attention, Prefix Cache, and Prefill/Decode Disaggregation (2026)

The single biggest determinant of whether an LLM feature is economically viable in production is not which model you chose — it is how the KV-cache is managed across the serving stack. This article is a practitioner reference for KV-cache engineering in 2026: paged attention as baseline, prefix caching for RAG hit rates of 70-85%, prefill-decode disaggregation for high-throughput deployments, and three 2026 production case studies (Gemma 4 cross-layer KV sharing in vLLM, Laguna XS.2 non-uniform tensor parallelism, DeepSeek V4 CSA+HCA reducing 1M-context cache by 90%). The compounding effect of the full stack: $1.39 → $0.07 per million tokens on the same physical hardware.

May 22, 2026Read
Two-Phase Commit Alternatives: Saga vs TCC vs Outbox vs Reservation-Then-Commit — A Decision Matrix for Distributed Transactions (2026)
microservices-patterns1 min read

Two-Phase Commit Alternatives: Saga vs TCC vs Outbox vs Reservation-Then-Commit — A Decision Matrix for Distributed Transactions (2026)

Two-phase commit is the wrong default for distributed microservices: it blocks on coordinator failure, requires XA on every participant, holds locks across third-party-API latency, and breaks under network partitions. This article is a decision matrix for the five patterns that have replaced it — Saga, Try-Confirm-Cancel, Outbox, Reservation-Then-Commit, and Idempotency-with-Retry — including a deep treatment of TCC (the alternative most often misimplemented), the four-layer composition stack that production systems actually use, anti-patterns that recreate the 2PC problems, a multi-vendor travel-booking case study, and the trade-off summary that takes a workload description and points at the right combination.

May 22, 2026Read
OWASP LLM Top 10 (2025/2026): Architecture-Level Mitigations Mapped to Each Risk
cyber-security-patterns1 min read

OWASP LLM Top 10 (2025/2026): Architecture-Level Mitigations Mapped to Each Risk

The OWASP Top 10 for LLM Applications gave the industry a shared vocabulary for LLM risk in 2023 and a sharper, incident-evidenced revision in 2025. What it never gave — and was never intended to give — is the architecture that actually contains those risks in production. This article walks the 2025 list, and for each of the ten risks specifies the architectural mitigation that contains it, the topological layer where the mitigation lives (gateway, retrieval, model serving, agent loop, output handling, observability), the telemetry that proves it is working, and the common failure modes when teams treat the list as a checklist rather than a pipeline-architecture specification. New entries in the 2025 revision — System Prompt Leakage (LLM07), Vector and Embedding Weaknesses (LLM08), Unbounded Consumption (LLM10) — are addressed with the same per-risk architectural placement. The article closes with eight anti-patterns and a five-stage maturity ladder from "we read the list" to a continuous adversarial-testing capability that exercises the pipeline daily.

May 21, 2026Read
The Model Router Pattern: Cost-, Quality-, and Latency-Aware Routing Across LLM Providers (2026)
ai-architecture1 min read

The Model Router Pattern: Cost-, Quality-, and Latency-Aware Routing Across LLM Providers (2026)

The static model choice has become the most expensive architectural decision in a 2026 LLM system. The provider market now spans a 60-200x price spread end to end, the quality gap on the dominant production workload is below the noise floor of A/B telemetry, and the latency profile of frontier models has widened — not narrowed. The Model Router Pattern is the architectural answer: a routing layer that, for every inbound request, chooses the cheapest model on the provider mix whose expected quality and latency clear the workload-specific gate, with a deterministic fallback when the routing decision turns out wrong. This article specifies the three routing axes (cost, quality, latency) and why single-axis routing always degrades, the four router primitives (cascade, classifier, learned, embedding-similarity) and when each is right, the 2026 landscape (Martian, NotDiamond, RouteLLM, Bedrock Intelligent Prompt Routing, OpenRouter Auto, Portkey, LiteLLM), the four-component calibration loop, the cost math with worked per-million-token economics, eight anti-patterns, and the five-stage maturity ladder from "we picked GPT" to a router that contributes a measurable line item to gross margin.

May 21, 2026Read
Agentic RAG Architecture: Self-Query, Plan-Execute-Replan, Tool-Augmented Retrieval, and the Validation Loop (2026)
ai-architecture1 min read

Agentic RAG Architecture: Self-Query, Plan-Execute-Replan, Tool-Augmented Retrieval, and the Validation Loop (2026)

The RAG pipeline that won 2024 — one question, one dense-vector lookup, one LLM call grounded on the top-k — is not the system that ships to production in 2026. The replacement is not bigger embeddings or a better re-ranker; it is the retrieval loop. Agentic RAG composes four architectural primitives: self-query decomposition that turns a multi-part question into a structured plan, plan-execute-replan with explicit iteration budgets that bound the loop, tool-augmented retrieval with a schema-driven router that chooses between dense indexes / SQL / graph / web search, and a validation loop with a sufficiency critic (gate to terminate-or-replan) and a faithfulness critic (deterministic gate before emission). Together they produce a bounded, observable, auditable retrieval agent. This article is the architecture-first playbook: what each primitive does, how they compose, the four failure modes specific to agentic RAG, eight anti-patterns that account for most production incidents, and the five-stage maturity ladder from classical-RAG-with-LLM-wrapper to full audit-grade deployment.

May 20, 2026Read
Speculative Decoding in Production LLM Inference: EAGLE-3, Medusa, vLLM, and the 3× Throughput Math (2026)
ai-architecture1 min read

Speculative Decoding in Production LLM Inference: EAGLE-3, Medusa, vLLM, and the 3× Throughput Math (2026)

The single largest under-used lever in production LLM inference in 2026 is speculative decoding. A correctly tuned vLLM deployment with EAGLE-3 or Medusa heads delivers 2.5–3.2× throughput on the same hardware for the same model with bit-exact outputs. The arithmetic: with α=0.8 acceptance, K=5 speculation length, and draft/target cost ratio c=0.08, the speedup formula (1 − α^(K+1)) / ((1 − α) × (K × c + 1)) lands around 2.4× and rises to 3× as α climbs. Most production deployments have not adopted it, not because the technique is exotic but because the operational subtleties — draft-model selection, acceptance-rate decay on long contexts, batch interaction effects, and the cases where naive speculation actively loses — are not well understood. This article is the production playbook: what speculative decoding actually does to the autoregressive loop, the EAGLE / Medusa / Lookahead / n-gram family, the vLLM integration surface, the four workload shapes where speculation wins or loses, the long-context failure mode that catches teams off-guard, eight anti-patterns, and a five-stage maturity ladder.

May 20, 2026Read
Hybrid Search and Re-ranking in Production RAG: BM25, Dense Vectors, Cross-encoders, and Everything In Between (2026)
ai-architecture1 min read

Hybrid Search and Re-ranking in Production RAG: BM25, Dense Vectors, Cross-encoders, and Everything In Between (2026)

The single biggest reason production RAG systems return confident wrong answers is not the LLM, the prompt, or the chunking — it is the retriever returning the wrong documents into the top-k. Dense-vector-only retrieval gives 70% recall on conceptual queries and 30% on exact-term queries — and a better embedding model does not fix it because the failure mode is structural. The architecture the field has converged on in 2026: sparse retriever (BM25 or SPLADE) + dense retriever (bi-encoder embeddings) running in parallel, fused via RRF or weighted-α, cross-encoder re-ranker over the top-50 candidates, MMR diversification, ACL/freshness pre-filter, query understanding in front. This article is the deep-dive on what each primitive is doing, why each fails, the latency budget, eight anti-patterns, and the five-stage maturity ladder from single-retriever to calibrated-fusion-with-online-feedback.

May 19, 2026Read
Modules vs Vertical Slices: Macro vs Micro Architecture in the Modular Monolith (2026)
microservices-patterns1 min read

Modules vs Vertical Slices: Macro vs Micro Architecture in the Modular Monolith (2026)

The argument "Clean Architecture vs Vertical Slice Architecture" is a category error — the two operate on different axes. A module is a macro-architectural decision about bounded contexts, public contracts, data ownership and communication style. A vertical slice is a micro-architectural decision about feature folder organisation inside a module. The killer property of a real modular monolith is that the two axes are independent: heterogeneous internals (Clean Architecture in one module, vertical slices in another, transaction scripts in a third) live safely behind homogeneous module boundaries enforced by project references, ArchUnit rules, and schema grants. This article is the technical deep-dive: the five enforceable module properties, the four slice properties, the cross-module communication spectrum from in-process method calls to outbox-backed event buses, the per-module internal-style decision matrix, multi-layered boundary enforcement, eight anti-patterns, and the five-stage maturity ladder from layered monolith to deliberate modular-monolith target state.

May 19, 2026Read
Agentic AI Debugging: When the Loop Doesn't Stop (2026)
ai-architecture1 min read

Agentic AI Debugging: When the Loop Doesn't Stop (2026)

The single most expensive failure mode of an agentic system is not the agent producing the wrong answer — it is the agent producing no answer while burning through tool calls, context, and provider budget in a tight loop the runtime did not detect. Six failure modes (infinite tool-call loop, plan-execute oscillation, sub-agent recursion, context thrash, hallucinated arguments, silent budget burn), six detection signals (step cap, semantic similarity, cost slope, identical call, delegation depth, context utilisation), five containment primitives (hard step-cap, budget kill-switch, tool-call dedupe, plan-diff guard, supervisor halt), a state machine with running/watching/throttled/halted, a seven-field RCA template, 8 anti-patterns, and a 5-stage maturity ladder. This is how runaway loops become bounded incidents.

May 18, 2026Read
Evaluation-Driven Development: Replacing TDD for LLM Systems (2026)
ai-architecture1 min read

Evaluation-Driven Development: Replacing TDD for LLM Systems (2026)

Test-driven development does not survive the transition to LLM systems — the assertion cannot be strict-equality, the correct output is a distribution, the red-green-refactor loop has no green, and the assertion itself is fallible. Evaluation-driven development is the discipline that replaces TDD: the same shape of "write the assertion before the implementation, ratchet it as the implementation improves, gate every change on the verdict", but with eval sets instead of unit tests, distribution verdicts instead of boolean pass-fail, calibrated LLM judges instead of strict equality, and a ratcheted baseline instead of a fixed expected output. This article is the methodology, the eval-set hygiene (golden, regression, adversarial, drift), the four eval layers (unit, scenario, shadow, canary), the LLM-as-judge calibration practice, the CI integration, 8 anti-patterns, and the 5-stage maturity ladder.

May 18, 2026Read
LLMjacking 2026: How Attackers Hijack Your Bedrock and OpenAI Quota — and the Seven-Layer Defence That Stops the $84,000 Weekend
ai-architecture1 min read

LLMjacking 2026: How Attackers Hijack Your Bedrock and OpenAI Quota — and the Seven-Layer Defence That Stops the $84,000 Weekend

A finance team walked into the office on a Monday morning in early 2025 and found an $84,000 invoice for the previous 48 hours. The application had not been defaced; no customer data had been exfiltrated; the dashboards were green. The bill was the breach. This is LLMjacking — the unauthorised hijack of cloud-hosted LLM resources for compute monetisation, the AI-security failure mode that does not look like a security incident until the invoice arrives. The seven-layer defence-in-depth stack is the architectural response: workload identity replacing static keys, hard quota at the gateway, model-level RBAC, network isolation, behavioural analytics, automated kill switch, and continuous credential hygiene. AWS-native reference architecture with Azure and GCP equivalents, attack-lifecycle map from initial access to weekend burn, eight anti-patterns retired, five-stage maturity ladder, and the Monday-morning 24h / 7d / 30d action checklist that materially reduces exposure by Friday.

May 16, 2026Read
AI Compliance Architecture: One Control Plane for EU AI Act, GDPR, DPDP, HIPAA, and APPI (2026)
ai-architecture1 min read

AI Compliance Architecture: One Control Plane for EU AI Act, GDPR, DPDP, HIPAA, and APPI (2026)

A reference control-plane architecture for AI systems that have to satisfy multiple regulatory regimes at once. Covers inventory, policy, release gates, runtime controls, and the evidence fabric that connects them.

May 15, 2026Read
Air-Gapped AI Architecture: Offline LLM Systems for Regulated and Classified Environments (2026)
ai-architecture1 min read

Air-Gapped AI Architecture: Offline LLM Systems for Regulated and Classified Environments (2026)

A reference architecture for offline LLM systems in air-gapped environments. Covers signed update flows, local registries, offline retrieval, observability, security controls, and the real cost profile of air-gapped AI.

May 15, 2026Read
Multi-Tenant RAG Isolation: The 7 Attack Vectors and the Architecture That Closes Them (2026)
ai-architecture1 min read

Multi-Tenant RAG Isolation: The 7 Attack Vectors and the Architecture That Closes Them (2026)

Multi-tenant RAG has a security model that does not exist in single-tenant RAG and is not covered by generic SaaS multi-tenant discipline. The 2024–2025 incident record now has enough cross-tenant RAG leakage cases to classify the failure modes, and the result is a seven-vector taxonomy: cross-tenant retrieval leakage, embedding-space collisions, metadata-filter bypass, shared-index poisoning, re-ranker leakage, eval-set contamination, response-cache cross-talk. This article is the seven vectors with their mechanism and architectural defence, the per-tenant namespace pattern that closes them at every data surface, the eight anti-patterns that produce the bad outcomes, and the maturity ladder from Stage 0 (single shared everything) to Stage 4 (continuously-validated isolation).

May 14, 2026Read
Cost Engineering for LLM Features: From $100k to $1M Monthly Spend (2026)
ai-architecture1 min read

Cost Engineering for LLM Features: From $100k to $1M Monthly Spend (2026)

The $100k to $1M monthly LLM-spend transition is the architecturally serious crossing in the life of an LLM product. The teams that handle it well treat cost as a first-class architectural property — instrumented, budgeted, gated, attributed, and tuned — and they build the five-layer stack of budget gate, semantic cache, dynamic router, prompt compactor, and inference layer with an attribution feedback loop wrapped around it. This article is the architecture, the order to build it in, the 10k-RPM unit-economics drill-down that produces a 64% reduction through composed savings, the unglamorous levers (prefill/decode separation, KV-cache reuse, speculative decoding, batch endpoints, output-length discipline), the spot/reserved/on-demand procurement mix, 8 anti-patterns that produce the bad spend curve, and the 5-stage maturity ladder.

May 14, 2026Read
Build a Multi-Agent AI System with LangGraph + MCP + A2A: Beginner-Friendly End-to-End Tutorial (2026)
ai-architecture1 min read

Build a Multi-Agent AI System with LangGraph + MCP + A2A: Beginner-Friendly End-to-End Tutorial (2026)

A full beginner-friendly walk-through of building a four-agent AI system on a laptop with no GPU and a free LLM. We use LangGraph for orchestration (state, nodes, edges, conditional edges, checkpointing, human-in-the-loop with interrupt), MCP for tool access (the official filesystem server via stdio), and A2A for cross-process agent calls (agent card at /.well-known/agent-card.json, JSON-RPC message lifecycle). The four agents form a Learning Accelerator — a Curriculum Planner, an Explainer that reads local notes via MCP, a Quiz Generator exposed as an A2A server, and a Progress Coach supervisor that orchestrates the rest with SQLite checkpointing. Provider switch covers Gemini 2.0 Flash (free, default), Groq (free, fast) and OpenAI (cents per run). Langfuse for traces, DeepEval for LLM-as-judge regression tests. Every file is shown in full inline; no companion repo needed.

May 13, 2026Read
Prompt Injection Defence in Depth (2026): Six Layers from Input Sanitisation to Output Firewall
ai-architecture1 min read

Prompt Injection Defence in Depth (2026): Six Layers from Input Sanitisation to Output Firewall

Prompt injection in 2026 is no longer a research curiosity; it is the day-one architectural assumption. The six-layer defence-in-depth stack is the engineering response: input sanitisation and normalisation, intent classifier and injection detector, prompt-template hardening with delimiters and role separation, tool-use authorisation policy outside the prompt, output classifier and secondary review LLM, output firewall for egress filtering and action-effect simulation. This article walks each layer with its threat model, engineering surface, and operational discipline; the build-order rationale; the composition with category-aware guardrails, agent circuit breakers, observability, and incident response. 8 anti-patterns retired, 5-stage maturity ladder, and the honest summary of where the field sits in early 2026.

May 13, 2026Read
Agritech AI Architecture: Pasture Vision, Livestock Behaviour Models, and Low-Bandwidth Edge (NZ Reference, 2026)
ai-architecture1 min read

Agritech AI Architecture: Pasture Vision, Livestock Behaviour Models, and Low-Bandwidth Edge (NZ Reference, 2026)

New Zealand agritech in 2026 lands the AI architecture conversation hardest on the constraints mainstream cloud-AI tutorials assume away: solar-powered devices on the cow's collar, intermittent cellular and satellite connectivity, the welfare envelope that takes precedence over production, and the data co-governance arrangement under the Algorithm Charter and Te Tiriti o Waitangi. This article walks the engineering deliverables for an agritech AI architecture in 2026: edge-first inference with welfare envelope on-device, multispectral pasture-vision with fixed-tower-drone-satellite fusion, behaviour-model training with the labelling discipline as the value-creating activity, store-and-forward synchronisation with explicit conflict resolution, federated learning across farms, Te Tiriti and Algorithm Charter compliance engineered into the architecture. NZ-anchored to Halter, Fonterra, Gallagher, LIC, AgResearch and globally portable. 8 anti-patterns, 5-stage maturity ladder.

May 13, 2026Read

Gardez une longueur d'avance

Analyses hebdomadaires approfondies sur les systèmes d'IA, l'architecture cloud, les systèmes distribués et le leadership en ingénierie. Rejoignez plus de 5 000 ingénieurs.