Trajectory & Process Analysis
Analyzing agent reasoning chains and decision processes.
Comparative Trajectory Analysis
Systematic methods for comparing agent trajectories across versions, configurations, or models to diagnose performance differences and identify regression points.
Error Recovery Evaluation
A framework for measuring how effectively agents detect, diagnose, and recover from failures encountered during task execution.
Planning Quality Assessment
Evaluating the quality of an agent’s plans before execution begins, measuring completeness, feasibility, efficiency, and robustness as predictors of downstream success.
Process Reward Models
Specialized models trained to score individual steps in an agent’s trajectory, enabling automated fine-grained evaluation of reasoning and execution quality.
Specification Gaming Detection
Methods for identifying when agents achieve stated objectives through unintended means that satisfy the evaluation metric without fulfilling the evaluator’s true intent.
Tool Use Correctness
A comprehensive evaluation framework for assessing the full lifecycle of agent tool usage, from selection through parameterization, execution, and result interpretation.
Trajectory Quality Metrics
Quantitative metrics that evaluate the quality of an agent’s step-by-step execution path, not just whether it reached the goal.