A/B Testing & Online Experimentation for LLM Features

Q: Why are offline evals not enough for LLM features?

Offline evaluations are necessary but fundamentally limited because they measure a model or prompt against a frozen dataset under artificial conditions, and that cannot capture what happens with real users on live traffic. An offline eval can tell you a variant scores higher on your curated test set, but it is blind to several things that decide whether the change is actually good in production. It does not see real cost and latency under concurrent load, it does not reflect the long tail of genuine user queries that your test set never anticipated, and it cannot detect that a change which helps the average case quietly degrades an important segment such as non-English users or a specific workflow. Worst of all, a higher offline score creates false confidence that leads teams to ship regressions. The correct mental model is that an offline eval produces a hypothesis — this change might be better — and only a controlled online experiment, which randomises real users and measures their real outcomes plus cost and latency guardrails, can confirm or refute that hypothesis. Use offline evals as a fast pre-deploy gate and online experiments as the decision.

Q: What is an OEC and what are guardrail metrics?

The OEC, or Overall Evaluation Criterion, is the single primary metric an experiment is designed to move, chosen to represent the true goal of the feature — for example task completion rate, helpful-vote rate, issue resolution rate, or a downstream conversion. Committing to one OEC before launching forces clarity about what success means and prevents the common trap of fishing through dozens of metrics after the fact until one looks positive. Guardrail metrics are the constraints that must not regress even if the OEC improves: typical guardrails for an LLM feature are p95 and p99 latency, cost per request, error and refusal rate, and safety or policy-violation flags. They exist because LLM changes routinely trade these dimensions against each other — a prompt that produces better answers may triple cost or add hundreds of milliseconds of latency, which can make it a net loss despite a higher quality score. A change ships only when the OEC improves with statistical significance and every guardrail stays within its acceptable bound, which is exactly the multi-dimensional judgement a single eval number cannot express.

Q: What is statistical power and why do LLM experiments need so much traffic?

Statistical power is the probability that an experiment will detect a real effect of a given size when one truly exists, and the convention is to design for 80 percent power. An underpowered experiment — one with too few samples — will frequently fail to reach significance even when the treatment genuinely helps, so real improvements get discarded as noise. The required sample size grows with how small an effect you want to detect and how noisy your metric is, and it shrinks as your baseline rate moves toward the middle. LLM experiments often need more traffic than teams expect for two reasons: the metrics that matter, such as helpful-vote rate or task completion, tend to be noisy and have high variance, and the effects you care about are often small relative deltas like a two percent lift, which require large samples to distinguish from chance. The practical discipline is to estimate the required sample size before launching, using your baseline conversion rate, the minimum detectable effect you consider worth shipping, a significance level usually set at 0.05, and 80 percent power, and then to run the experiment to that horizon rather than stopping when results first look good.

Q: What is the peeking problem and how do I avoid it?

Peeking is the practice of repeatedly checking a fixed-horizon experiment as data accumulates and stopping as soon as the result crosses the significance threshold, and it is one of the most common ways teams reach confident but wrong conclusions. The reason it is dangerous is statistical: a standard fixed-horizon p-value is only valid if you look once at the predetermined sample size. If you check the test many times and stop at the first moment it appears significant, you dramatically inflate the false-positive rate, because random fluctuations will occasionally cross the threshold even when there is no real effect, and continuous monitoring all but guarantees you eventually catch one of those fluctuations. There are two robust fixes. The first is to pre-commit to a sample size or run duration before launching and only evaluate significance once you reach it, ignoring the interim wiggles. The second, more flexible option is to use sequential testing methods — always-valid p-values or confidence sequences — which are explicitly designed to remain statistically valid no matter how often you look, so you can monitor an experiment continuously and stop early safely when the evidence is genuinely strong.

Q: When should I use interleaving instead of an A/B test?

Interleaving is the better choice when you are comparing two ranking or retrieval systems, such as two different vector-search configurations or two rerankers, because it is far more sensitive than a conventional A/B test for that specific question. In a classic A/B test you split users into two groups and show each group one system, then compare aggregate metrics; the comparison is between-subjects and therefore noisy, requiring substantial traffic. Interleaving instead mixes the results from both variants into a single ranked list shown to the same user within one query, then attributes the user's clicks or selections to whichever variant contributed each chosen result. Because every user effectively evaluates both systems simultaneously on the same query, the comparison is within-subjects and removes much of the between-user variance, which is why interleaving commonly reaches a reliable verdict with one to two orders of magnitude less traffic than an equivalent A/B test. The trade-off is that interleaving is specialised to ranking comparisons and does not generalise to arbitrary outcome metrics like downstream conversion, so use it for retrieval and ranking changes and fall back to A/B testing for end-to-end behavioural outcomes.

Q: What is the right randomisation unit for an LLM experiment?

The randomisation unit is the entity you bucket into control or treatment, and choosing it correctly is essential for a valid experiment. The most common correct choice is the user (or a stable anonymous id), because a user should have a consistent experience across their session and because user-level outcomes are what you usually care about. The cardinal mistake is randomising per request, which assigns the same user to different variants on different calls; this contaminates the comparison, breaks the independence assumptions the statistics rely on, and makes user-level outcome metrics meaningless because a single user contributes to both arms. In some cases a coarser unit is required: if your feature operates at the level of an organisation or workspace and users within it share state or can see each other's results, you should randomise by org to avoid interference between treatment and control users in the same group. The implementation that guarantees consistency is deterministic hash-based assignment — hash the chosen unit's id together with the experiment id to assign a bucket — so the same unit always lands in the same variant for the life of the experiment, regardless of how many requests it makes.

Q: Can I use an LLM-as-judge as my experiment metric?

You can use an LLM-as-judge as one input to an experiment, but it should rarely be the sole metric, and you must guard against its biases. An LLM judge is attractive because it scales: it can score thousands of responses for qualities like helpfulness or correctness far more cheaply than human raters. However, it has well-documented failure modes — it can favour longer or more confident answers regardless of correctness, it can be inconsistent across runs, and if the same model family generated and is judging the responses there is a real risk of self-preference bias that systematically flatters one variant. The sound approach is to treat the LLM judge as a proxy that must be validated against ground truth: periodically check its scores against human judgements on a sample to confirm it agrees, and pair it with metrics grounded in real user behaviour such as helpful votes, task completion, or downstream conversion, plus the cost and latency guardrails. In short, an LLM judge can accelerate measurement and is useful for offline iteration, but a production ship decision should rest on real user outcomes and guardrails, with the judge as corroborating evidence rather than the verdict.

Q: When should I use a multi-armed bandit instead of an A/B test?

A multi-armed bandit is the better tool when you have many variants to compare and the cost of showing users a worse variant during the test is significant, because a bandit adaptively shifts traffic toward the better-performing variants as evidence accumulates rather than holding fixed allocations to the end. A classic A/B (or A/B/n) test keeps each variant at a fixed share of traffic for the whole run, which is ideal when your goal is a clean, unbiased causal estimate of how each variant compares, but it is wasteful when you are exploring a large set of candidate prompts or models because a lot of traffic keeps flowing to clearly inferior options. A bandit minimises this regret by exploiting what it has learned while still exploring, which is well suited to high-volume surfaces where you mainly want to maximise the cumulative outcome and care less about a precise effect size for every arm. The trade-offs are that bandits give less clean causal estimates and confidence intervals than fixed A/B tests, and they assume the environment is reasonably stationary. A common pattern is to use bandits for ongoing optimisation of many variants on high-traffic surfaces, and fixed A/B tests when you need a trustworthy causal measurement of a specific change before committing to it.