Evaluating AI Agents: Trajectory & Tool-Use Evals (2026)

Q: How is evaluating an AI agent different from evaluating an LLM?

Standard LLM evaluation grades an output — given an input, is the produced text correct, relevant, or well-formed — and for single-turn tasks that is enough. Agent evaluation cannot stop at the output, because an agent's quality lives in its trajectory: the sequence of decisions, tool calls, arguments, and recoveries it makes to reach a result. The same final answer can come from a clean three-step path or from a dangerous, lucky, twelve-step detour that called the wrong tools, leaked data through a bad intermediate step, or looped repeatedly before stumbling onto the answer. Output-only scoring rewards the lucky detour and the dangerous shortcut exactly as much as the correct path, which is precisely why agents that pass naive output evals still fail in production. Agent evaluation therefore measures behaviour across the whole multi-step task: did it select the right tools, pass correct arguments, take only necessary actions, recover sensibly from errors, stay within safety bounds, and do so efficiently and consistently. The core shift is from "was the answer right" to "was the behaviour right," because in an autonomous system that takes real actions, how the agent reached the answer determines whether it is safe to deploy, and only trajectory-aware evaluation can see it.

Q: What should an agent evaluation actually measure?

An agent eval scores several dimensions, each catching a failure class that output-only evaluation misses. Task success answers whether the agent achieved the goal at all, but on its own it hides how. Tool selection measures whether the agent called the right tools and avoided unnecessary or wrong ones. Argument correctness checks whether the arguments passed to each tool were valid and intended, catching malformed calls and wrong parameters. Trajectory quality assesses whether the overall path was sensible and safe, distinguishing a clean route from a lucky-but-dangerous one. Efficiency tracks steps, tokens, cost, and latency, exposing loops and redundant actions that a correct final answer conceals. Safety checks whether any unsafe or out-of-scope action occurred, such as a data leak or a destructive call. And robustness measures consistency across repeated runs and small input variations, surfacing the non-deterministic flakiness that makes an agent unreliable even when it sometimes works. The unifying point is that a single "did it answer correctly" number hides the behaviours that actually determine whether an agent is safe to deploy, so you measure the path rather than only the destination. In practice you combine these into a scorecard, with safety and core success treated as hard requirements and the rest as quality metrics, because an agent that is correct but unsafe or wildly inefficient is not production-ready.

Q: What is a trajectory evaluation and how do you build one?

A trajectory evaluation scores the full sequence of an agent's steps rather than just its final output, and it requires the agent to be observable: you instrument every step — the model's reasoning, each tool call with its arguments, each tool result, and the evolving state — as a structured trace, the same spans you would use for production observability. With traces captured, two complementary check types become possible. Reference-based evaluation applies to tasks with a known good path: you assert that the expected tool calls happened with the expected arguments, using exact or fuzzy matching. It is precise but brittle if you demand one rigid golden sequence, because real agents legitimately reach goals by different valid routes. Reference-free evaluation applies to open-ended tasks: you score the trajectory against a rubric — were the tool choices justified, the arguments valid, the steps necessary, the recovery sensible — often using an LLM-as-judge to grade the trace against explicit criteria, supplemented by deterministic checks such as no disallowed tool, step count under a cap, and no PII in arguments. The strong design combines them: deterministic assertions for the invariants that must always hold, especially safety and hard step caps, and rubric or judge scoring for the judgement-heavy parts. You then assemble a dataset of representative tasks with their traces and expected behaviours — the agent equivalent of a test suite — so the evaluation is repeatable and can run on every change rather than being a one-off manual inspection.

Q: How do you score multi-step agent behaviour without a brittle golden path?

The trap is over-fitting the evaluation to one exact sequence of steps, because real agents reach goals by different valid paths and demanding a single golden trajectory produces false failures and a brittle suite that breaks on every harmless variation. The fix is to separate the must-haves from the may-varies. Encode as deterministic, always-on assertions the invariants that any correct run must satisfy regardless of path: no destructive or out-of-scope tool call, no PII in tool arguments, the final goal state actually reached, and the step count under a hard cap. These never produce false positives on legitimate alternate routes and never miss a safety violation, so they are reliable hard gates. Then score the path-dependent quality with a rubric that rewards any sensible trajectory rather than one specific one: were tools chosen appropriately, were redundant steps avoided, was an error recovered from gracefully. Because agents are non-deterministic, you must never score a single run; instead execute each task several times, and with small input variations, then report the success rate, the variance, and the worst case rather than a one-shot pass or fail. A task that succeeds three times in five is a materially different risk from one that succeeds five in five, and an average over runs would hide that difference. This combination — deterministic invariants for safety and structure, rubric scoring for path quality, and statistical aggregation over repeated runs — is what turns "it worked when I tried it" into a defensible number you can gate a release on.

Q: Why must agent evals run multiple times per task?

Because agents are non-deterministic, and a single run tells you almost nothing about reliability. The same agent given the same task can choose different tools, take different numbers of steps, and succeed or fail across runs due to sampling variation, so one passing trial is not evidence that the agent is dependable — it may be the one success in five attempts. Evaluating each task across multiple runs, ideally with small input variations as well, lets you report the metrics that actually describe production risk: the success rate, the variance between runs, and the worst-case outcome. These reveal failure patterns a single run hides, such as an agent that usually works but occasionally takes a dangerous or wildly inefficient path, or one whose success depends on lucky sampling. Critically, you should gate on the tail, not just the average: a high mean success rate can conceal a catastrophic one-in-twenty run that, in an agent taking real actions, could mean a destructive call or a data leak. Reporting and gating on worst-case and variance forces those rare-but-severe behaviours into view, where a mean would average them away. The practical implication is that an agent evaluation harness must be built to run tasks repeatedly and aggregate statistically from the start; a harness that runs each task once is measuring luck, not reliability, and will pass agents that are unsafe to deploy.

Q: How do you run agent evaluation as a release gate?

You wire the agent evaluation suite into continuous integration so it runs automatically on every change that can affect behaviour — a model swap, a prompt edit, a tool change, or a framework upgrade — because any of those can silently break trajectory quality even when a handful of manual tries look fine. The release is then gated on the metrics that matter: task-success rate above a defined threshold, zero violations of the safety invariants treated as a hard block, efficiency within budget, and no regression against the previous version on the trajectory rubric. You track the trend over time rather than only the latest snapshot, so slow degradation accumulating across successive model upgrades becomes visible before it causes an incident. Confirmed production failures are turned into new evaluation cases, so the suite continuously hardens against exactly the failures you have actually seen, the same incident-to-test loop that disciplined software testing uses. The mindset is identical to test-driven development, adapted for non-deterministic, multi-step behaviour: no agent change ships without passing the behaviour gate, and the gate is automated rather than a manual review. This converts agent reliability from a hope that rests on a developer having tried a few prompts into an enforced, measured property of the release process, which is the difference between an agent demo and an agent you can responsibly operate.

Q: What are the common anti-patterns in agent evaluation?

Eight recur. Output-only evaluation, which grades just the final answer and so rewards lucky and dangerous paths equally — evaluate the trajectory instead. One rigid golden path, which asserts a single exact sequence and false-fails legitimate alternate routes — separate always-on invariants from rubric-scored path quality. Single-run scoring, which proves nothing because agents are non-deterministic — run each task multiple times and report success rate, variance, and worst-case. No observability, since you cannot evaluate a trajectory you cannot see — instrument every step as a structured trace first. Judge-only evaluation with no deterministic safety checks, because an LLM judge can miss a destructive call — hard-assert safety invariants deterministically. Evals that never run in CI, which rot on the next model upgrade — gate every change on the suite. No incident-to-eval loop, so the same failure recurs — turn every confirmed production failure into a regression case. And optimising the average while ignoring the tail, because a high mean hides the catastrophic one-in-twenty run — track and gate on worst-case and variance. Each anti-pattern maps to a concrete fix, and together they convert agent evaluation from a reassuring but misleading "it worked when I tried it" into a rigorous, automated measure of whether an autonomous agent is actually safe and reliable to deploy.

Q: How do we mature our agent-evaluation practice over time?

Use a five-stage ladder. Stage 0 is vibes: reliability rests on "it worked when I tried it," with no traces, output-only checks, and single runs, so the agent's true behaviour is unknown and unsafe to scale. Stage 1 is observable: every agent step is instrumented as a structured trace — reasoning, tool calls, arguments, results, and state — making behaviour inspectable, which is the prerequisite for everything else. Stage 2 is trajectory-scored: a dataset of representative tasks is scored across task success, tool selection, argument correctness, efficiency, and trajectory quality, not just the final answer, so the failure classes that output-only evals miss come into view. Stage 3 is invariant and statistical: deterministic safety invariants always hold as hard gates, path quality is rubric-scored to avoid golden-path brittleness, and every task runs multiple times reporting success rate, variance, and worst-case rather than a single pass. Stage 4 is gated: the suite runs in CI on every model, prompt, and tool change, blocks releases on safety violations and regressions, trends over time to catch slow degradation, and absorbs every production incident as a new case, so agent reliability becomes a measured and enforced property rather than a hope. Track trajectory-rubric scores, safety-invariant pass rate, success-rate variance, and worst-case outcomes as the headline metrics, and treat the move from single-run output checks to gated, statistical, trajectory-aware evaluation as the path from an impressive demo to an agent you can operate responsibly in production.