Foundations of Agent Evaluation

Core concepts, challenges, and frameworks for evaluating AI agents.

Compounding Errors in Multi-Step Tasks

When an agent executes a sequence of steps with independent per-step success probability p, the overall success probability decays exponentially as p^n, making long-horizon task evaluation fundamentally different from single-step evaluation.

Evaluation Dimensions Taxonomy

A systematic framework for the full space of agent evaluation dimensions – accuracy, cost, latency, safety, reliability, tool use, planning quality, and security – because single-metric evaluation is almost always misleading.

Evaluation-Driven Development

The most effective agent development methodology starts with a small set of real failure cases, builds evaluations around them, iterates the agent against those evaluations, and continuously expands the eval suite from production incidents – yet 29.5% of teams run no evaluations at all.

Multiple Valid Solutions

Agents solving open-ended tasks produce legitimately different solutions, making reference-based evaluation fundamentally inadequate and requiring solution-agnostic methods like test-based verification, constraint checking, and LLM-as-judge.

Outcome vs. Process Evaluation

Agent evaluation must weigh what the agent accomplished (outcome) against how it accomplished it (process), because either dimension alone can be dangerously misleading.

The Non-Determinism Problem

Agent evaluation must account for inherent randomness from LLM sampling, stochastic tool responses, and environment variability – requiring multiple runs, confidence intervals, and specialized metrics like pass^k to produce reliable results.

Why Agent Evaluation Is Hard

Evaluating AI agents is fundamentally harder than evaluating language models or traditional software because agents operate in open-ended environments with non-deterministic behavior, multi-step compounding errors, and multiple valid solution paths.