BD Brain Drip
🤖
Module 03 8 concepts

Automated Evaluation Methods

LLM-as-judge, rubric-based scoring, and automated metrics.

01

Agent-as-Judge

Agent-as-Judge extends LLM-as-Judge by giving the evaluator its own tools, multi-step reasoning, and environment access to examine entire agent trajectories rather than just final outputs.

02

Code Execution-Based Evaluation

Code execution-based evaluation uses automated test suites as objective oracles for assessing coding agent output, providing reproducible and scalable correctness verification while facing limitations around test completeness and gaming vulnerability.

03

Environment-State Evaluation

Environment-state evaluation assesses agent performance by checking the state of the world after the agent acts, verifying that the environment reflects the intended outcome regardless of the specific path the agent took.

04

Evaluation Pipeline Architecture

Evaluation pipeline architecture is the end-to-end engineering of systems that orchestrate task loading, environment provisioning, agent execution, output collection, scoring, and result aggregation into a reliable, scalable evaluation infrastructure.

05

Judge Calibration and Validation

Judge calibration and validation is the practice of systematically verifying that automated evaluators produce scores aligned with human expert judgments, detecting and mitigating biases, and monitoring judge quality over time.

06

Multi-Dimensional Debate Evaluation

Multiple LLM judge agents, each representing a different evaluative dimension, debate the quality of agent output to surface issues that single-judge evaluation misses.

07

Reference-Free Evaluation

Reference-free evaluation assesses agent output quality without gold-standard answers, using methods like self-consistency checks, constraint satisfaction verification, logical coherence analysis, and execution-based testing.

08

Rubric Engineering

Rubric engineering is the systematic design of evaluation criteria that automated judges can apply consistently, transforming subjective quality assessments into reproducible, operationalized scoring frameworks.