Evaluation-Driven Development (EDD) 2026: The TDD Replacement for LLM Systems

Q: Why does TDD break for LLM-powered features and what specifically needs to change in the development discipline?

TDD breaks structurally in four ways that compound. The strict-equality assertion that TDD depends on cannot be written for an LLM call because two invocations of the same prompt produce two different outputs both of which may be correct, and setting temperature to zero does not fix the problem because different model versions and runtime conditions still produce different outputs. The "correct" output is a distribution rather than a single value — an LLM feature has a range of acceptable outputs for any given input (multiple correct rewrites, multiple correct summaries, multiple correct answers to an open-ended question), and the eval is a measurement of whether the output falls inside the range rather than equals a specific value. The red-green-refactor loop has no green — an EDD eval produces a score on a continuous scale, and the question is whether the score improved or regressed relative to the prior version; there is no binary pass-fail, only "better than the last version on the metric the team has agreed to optimise for, without regressing on the metrics the team has agreed to protect". The assertion itself is fallible — an LLM-as-judge can misclassify, a similarity metric can reward verbose outputs over concise ones, a rubric can be ambiguous in the cases the team cares most about — so the assertion has to be calibrated against human judgement periodically, a discipline that has no analogue in TDD. The change is that the inner loop reshapes around eval sets instead of unit tests, distribution verdicts instead of boolean pass-fail, calibrated judges instead of strict equality, and a ratcheted baseline instead of a fixed expected output. The same discipline shape — write the assertion before the implementation, ratchet the assertion as the implementation improves, gate every change on the assertion's verdict — survives the transition with the artefact reshaped from unit tests to eval sets.