Foundations of Agent Evaluation
Core concepts, challenges, and frameworks for evaluating AI agents.
Compounding Errors in Multi-Step Tasks
When an agent executes a sequence of steps with independent per-step success probability p, the overall success probability decays exponentially as p^n, making long-horizon task evaluation fundamentally different from single-step evaluation.
Evaluation Dimensions Taxonomy
A systematic framework for the full space of agent evaluation dimensions – accuracy, cost, latency, safety, reliability, tool use, planning quality, and security – because single-metric evaluation is almost always misleading.
Evaluation-Driven Development
The most effective agent development methodology starts with a small set of real failure cases, builds evaluations around them, iterates the agent against those evaluations, and continuously expands the eval suite from production incidents – yet 29.5% of teams run no evaluations at all.
Multiple Valid Solutions
Agents solving open-ended tasks produce legitimately different solutions, making reference-based evaluation fundamentally inadequate and requiring solution-agnostic methods like test-based verification, constraint checking, and LLM-as-judge.
Outcome vs. Process Evaluation
Agent evaluation must weigh what the agent accomplished (outcome) against how it accomplished it (process), because either dimension alone can be dangerously misleading.
The Non-Determinism Problem
Agent evaluation must account for inherent randomness from LLM sampling, stochastic tool responses, and environment variability – requiring multiple runs, confidence intervals, and specialized metrics like pass^k to produce reliable results.
Why Agent Evaluation Is Hard
Evaluating AI agents is fundamentally harder than evaluating language models or traditional software because agents operate in open-ended environments with non-deterministic behavior, multi-step compounding errors, and multiple valid solution paths.