Statistical Methods
Statistical rigor, confidence intervals, and significance testing.
Confidence Intervals for Agent Metrics
Confidence intervals transform meaningless point estimates like “72% success rate” into informative statements like “72% +/- 4.2% (95% CI),” making uncertainty explicit and comparisons honest.
Effect Size and Practical Significance
Statistical significance tells you whether a difference is real; effect size and practical significance tell you whether it matters – a distinction that prevents wasted deployments and missed opportunities.
Meta-Evaluation
Meta-evaluation evaluates the evaluation itself – measuring whether your benchmark suite actually discriminates between good and bad agents and has not become a stale, gameable target.
Regression Detection Statistics
Regression detection uses hypothesis testing and sequential analysis to distinguish genuine performance drops from natural variance, balancing fast detection against false alarms.
Sample Size and Power Analysis
Power analysis determines how many evaluation runs you need to draw statistically valid conclusions about agent performance, balancing rigor against cost.
Stratified Evaluation Design
Stratified evaluation replaces misleading single aggregate scores with performance profiles across task dimensions, revealing patterns like “excellent at easy tasks, catastrophic at hard ones” that flat averages hide.
Variance Decomposition
Variance decomposition identifies whether evaluation noise comes from model sampling, environment instability, task difficulty spread, or evaluator inconsistency – and tells you which source to fix first.