Statistical Methods

Statistical rigor, confidence intervals, and significance testing.

Confidence Intervals for Agent Metrics

Confidence intervals transform meaningless point estimates like “72% success rate” into informative statements like “72% +/- 4.2% (95% CI),” making uncertainty explicit and comparisons honest.

Effect Size and Practical Significance

Statistical significance tells you whether a difference is real; effect size and practical significance tell you whether it matters – a distinction that prevents wasted deployments and missed opportunities.

Meta-Evaluation

Meta-evaluation evaluates the evaluation itself – measuring whether your benchmark suite actually discriminates between good and bad agents and has not become a stale, gameable target.

Regression Detection Statistics

Regression detection uses hypothesis testing and sequential analysis to distinguish genuine performance drops from natural variance, balancing fast detection against false alarms.

Sample Size and Power Analysis

Power analysis determines how many evaluation runs you need to draw statistically valid conclusions about agent performance, balancing rigor against cost.

Stratified Evaluation Design

Stratified evaluation replaces misleading single aggregate scores with performance profiles across task dimensions, revealing patterns like “excellent at easy tasks, catastrophic at hard ones” that flat averages hide.

Variance Decomposition

Variance decomposition identifies whether evaluation noise comes from model sampling, environment instability, task difficulty spread, or evaluator inconsistency – and tells you which source to fix first.