AI-Native CI/CD for LLM Features 2026: 5 Gates, Eval, Canary

Q: How is CI/CD for LLM features different from MLOps CI/CD for trained models?

MLOps CI/CD ships a model artefact — the artefact under deployment is a trained or fine-tuned model file, the pipeline runs training, validation against a held-out set, model-registry promotion, canary or shadow scoring of the new model against the old, and progressive rollout. CI/CD for LLM features ships a configuration change — a prompt, a chain definition, a retriever parameter, a tool definition — against a model that is typically a third-party API and is itself outside the pipeline. The two are complementary, not substitutes: a fine-tuned model goes through MLOps to be promoted to production, then a prompt that targets that model goes through LLM-feature CI/CD to be canaried against the previous prompt. Mature AI organisations run both pipelines in parallel and compose them at the deployment layer. The dominant mistake in 2026 is teams using the MLOps pipeline as a proxy for prompt-change CI/CD, which is structurally wrong because the artefact, the cost profile, the test data, and the rollback mechanism all differ.

Q: What is a realistic cost budget for running eval gates on every PR?

For a feature with a golden set of 200 inputs, a regression set of 50 inputs, and judge-LLM scoring on both at 2026 GPT-4-class pricing, the offline eval gates (Gate 2 and Gate 3) cost roughly $2 to $10 per pipeline run depending on prompt length and judge complexity. The shadow eval gate (Gate 4) on 1,000 production traces costs roughly $5 to $50 per run, which is why it is typically opt-in via PR label rather than run on every commit. The full per-PR cost in a healthy implementation is in the order of $10 to $60. The mistake teams make is to run the full pipeline on every commit on a busy repo, which can produce thousands of dollars of CI cost in a quarter without proportional value. The fix is sampling on commit, full-set on PR open and on label, and aggressive caching keyed on the prompt hash so that no-op pushes do not re-run the eval. Hard cost caps per CI job are non-negotiable to prevent runaway loops from causing budget incidents.

Q: How do you decide what goes into the golden set versus the regression set?

The golden set captures the intended capability surface of the feature — for each capability the feature is supposed to support, there should be 5 to 30 representative inputs covering the typical case, common variants, and known edge cases, scored against an expected output property. The regression set captures past failures — every production incident that traced back to a prompt or chain regression contributes a permanent entry consisting of the input that triggered the incident, the wrong output, and the corrected output. The golden set evolves quarterly as the feature surface changes; the regression set grows monotonically because past failures are forever interesting. Gates differ: golden-set regressions are configurable thresholds (e.g. no more than 2% regression) that allow shipping changes that improve average behaviour at minor local cost; regression-set entries are hard pass-fail because the cost of recurring a known incident is higher than the cost of blocking a PR. Most organisations also maintain a separate adversarial sub-set for prompt-injection and jailbreak resistance with its own (typically tighter) threshold.

Q: Should we use the same model as judge as the model in production?

No, and this is one of the most common mistakes in 2026 LLM CI/CD implementations. Self-judging produces systematic over-rating because the judge shares the same biases, blind spots, and failure modes as the model under evaluation — it is structurally unable to detect a class of errors it would itself produce, it over-rates verbose responses because verbose responses look thorough to the judge, and it creates a self-referential loop that masks real regressions while reporting confident-looking scores. Best practice is to use a different model family as judge (Claude grading GPT, GPT grading Claude, an open-weight judge for both), use multiple judges with required agreement on critical evaluations, calibrate judges quarterly against a small human-labelled sample to detect judge drift, and budget judge cost as a first-class line item because it can equal or exceed the cost of running the candidate prompt itself. The judge is part of the test infrastructure and deserves the same architectural attention as any other piece of the pipeline.

Q: How fast does the kill switch need to be in a canary rollout?

Sub-thirty-second end-to-end from breach detection to traffic shifted away from the canary. The threshold matters because the cost of slow reversion grows linearly with traffic at the affected rate — a 1% canary on a feature serving 10,000 RPM exposes 100 RPM of users to a regression for the duration of the kill-switch latency. At thirty seconds the exposure is bounded; at ten minutes the exposure becomes a meaningful fraction of the daily request volume the feature serves. The implementation is a feature flag whose value is the prompt-registry version pointer; reversion is a flag flip read by every request, not a code redeploy that requires container restarts and pod cycling. Teams that implement the canary mechanism but forget to optimise the revert path discover during their first real incident that they have built a slow alarm rather than a kill switch. The architecture decision that enables the fast revert is decoupling the prompt artefact from the application code at the runtime layer; without that decoupling, no amount of automation makes the kill switch meet the latency budget.

Q: When should we add Gate 4 (shadow eval on production traces)?

When three conditions are simultaneously met: the feature has accumulated at least a few weeks of production traffic with sufficient volume that the trace stream represents the real input distribution, the team has established a privacy-respecting trace storage layer with PII stripping and a documented retention policy, and the offline gates (Gate 2 and Gate 3) are routinely passing prompt changes that nonetheless cause incidents in production. The third condition is the trigger — Gate 4 exists to catch the long-tail regressions that the curated golden set misses by construction because the curated set captures the imagined input distribution while the production stream captures the real one. Gate 4 is the most expensive gate and the most architecturally substantial to build (trace storage, PII handling, replay tooling, paired comparison reporting), which is why it is typically the third gate added rather than the first. Most production-serious teams reach Gate 4 within nine to twelve months of their first LLM feature shipping; teams under regulatory pressure typically reach it sooner because the audit trail Gate 4 produces is itself a compliance artefact.

Q: How does prompt diff review differ from code diff review?

A code diff review is dominated by reading line-level changes and judging their correctness against the surrounding code; a prompt diff review is dominated by reading semantic changes (reordering, rewording, added examples, changed variable interpolation) and judging their effect on the output distribution against the surrounding eval evidence. The PR surface for a prompt change should include the textual diff but also the lint report, the eval delta broken down by capability sub-set, the cost and latency deltas, the shadow-eval comparison with sampled side-by-side disagreements, the cost projection at production volume, the change in output length distribution, the rollout plan, and the named human reviewers required. A naive line-diff view of a prompt change without any of this scaffolding will be approved on autopilot because the reviewer cannot tell what the change actually does. Mature teams build a prompt-review bot that posts the eval and cost reports as PR comments automatically, links to the trace-replay surface, and blocks merge until at least one human reviewer has explicitly approved the prompt diff with a separate signal beyond the standard PR approval. The diff itself benefits from a structured-diff tool that surfaces semantic changes rather than raw line additions and deletions, which is increasingly available as a category in 2026 tooling.

Q: What is the minimum viable version of this pipeline for a small team?

Stage 1 of the maturity ladder: a single golden set of perhaps 30 to 50 hand-curated inputs, an eval that runs on merge using Promptfoo or an equivalent, the eval result posted to Slack or as a PR comment, and a documented manual rollback procedure that consists of pinning the prompt registry to the previous version. This is the cheapest possible automation that delivers real value — it surfaces regressions within hours rather than days and creates the institutional habit of reading eval reports before considering a prompt change shipped. From Stage 1, the next investment is usually Gate 1 (lint) and the regression set discipline rather than jumping to canary, because the leverage is highest where the cost is lowest. A team of three to ten engineers can reach Stage 2 (eval on PR with gating) within a quarter of focused investment without a dedicated AI platform engineer; reaching Stage 3 typically requires a dedicated platform investment because of the trace-storage and privacy work. The mistake to avoid is attempting to jump from Stage 0 to Stage 4 in a single quarter and shipping nothing useful in the meantime; the ladder is the ladder for a reason.

Q: Who owns this pipeline in the engineering organisation?

The design of the pipeline is owned by the AI architect (the role described in the companion article on day-to-day AI architect responsibilities), the implementation is owned by the AI platform engineering team in larger organisations or by the founding engineer doubling as platform engineer in startups, and the operation of each gate against a specific prompt change is owned by the application team shipping that change. The boundary that confuses organisations is who owns the eval set — the answer is the application team owns the contents of the golden and regression sets for their feature, the AI architect owns the cross-cutting evaluation standards (which judges, what threshold policies, what cost caps), and the platform team owns the harness that runs the evals. The AI architect is accountable for the pipeline existing and for its cross-team consistency; the platform team is accountable for it running reliably and within budget; the application teams are accountable for the gates passing for their features. The split mirrors the classical platform-architect / platform-engineering / product-engineering split and the same boundary disciplines apply.

Q: How does this pipeline handle multi-step agent and chain changes versus single-prompt changes?

The same five gates apply but the eval surface widens substantially. A multi-step agent change can affect outcomes through any combination of changed prompts, changed tool definitions, changed planner logic, changed memory policy, and changed routing — the eval set must therefore include end-to-end traces (input to final response) rather than only single-prompt inputs and outputs, and the judge must be able to score multi-step outcomes including intermediate tool calls and their results. The cost per eval grows roughly linearly with the depth of the agent loop, which is why agent changes typically run smaller eval sets per gate but more frequently and use higher-quality judges. Gate 4 (shadow eval on production traces) is particularly valuable for agent changes because the long-tail of agent failures (runaway loops, tool-call cycles, premature termination) is the part the curated eval set is least able to capture. Canary rollout for agent changes also requires longer dwell time at each canary stage because agent failures are often time-extended (a runaway loop manifests over minutes, not over a single request) and the kill-switch criteria must include detection of these temporal failure modes in addition to the per-request quality and cost metrics.