Evaluator-Optimizer Pattern

An iterative loop where one LLM generates content and another evaluates it with structured feedback, repeating until the output meets a defined quality threshold.

Prerequisites | structured-output.md edges-and-routing.md nodes.md

What Is the Evaluator-Optimizer Pattern?

Think of a writer and editor working together on a magazine article. The writer produces a draft, the editor reads it and returns a marked-up copy with specific notes: “The opening is weak, tighten the second paragraph, the conclusion needs a call to action.” The writer revises based on that feedback and resubmits. This back-and-forth continues until the editor approves the piece for publication.

In LangGraph, the evaluator-optimizer pattern formalizes this loop as a two-node cycle. A generator node produces content, and an evaluator node grades it using with_structured_output to produce typed feedback. A conditional edge inspects the grade: if it fails the threshold, control routes back to the generator with the evaluator’s feedback appended to state. If it passes, the loop exits. The structured output ensures that feedback is machine-readable, not just free text.

This pattern is powerful for any task where quality is subjective and iterative refinement is natural – writing, code generation, data extraction, translation, and summarization.

How It Works

Defining the State

from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages

class JokeState(TypedDict):
    topic: str
    joke: str
    feedback: str
    grade: str

The Evaluation Schema

from pydantic import BaseModel, Field

class JokeEvaluation(BaseModel):
    grade: str = Field(description="Either 'funny' or 'not funny'")
    feedback: str = Field(description="Specific suggestions for improvement")

Building the Generator and Evaluator Nodes

from langchain_openai import ChatOpenAI

generator_llm = ChatOpenAI(model="gpt-4o")
evaluator_llm = ChatOpenAI(model="gpt-4o").with_structured_output(JokeEvaluation)

def generate_joke(state: JokeState) -> dict:
    prompt = f"Write a joke about {state['topic']}."
    if state.get("feedback"):
        prompt += f" Previous feedback: {state['feedback']}"
    response = generator_llm.invoke(prompt)
    return {"joke": response.content}

def evaluate_joke(state: JokeState) -> dict:
    evaluation = evaluator_llm.invoke(
        f"Evaluate this joke. Is it funny?\n\nJoke: {state['joke']}"
    )
    return {"grade": evaluation.grade, "feedback": evaluation.feedback}

Wiring the Iterative Loop

from langgraph.graph import StateGraph, START, END

def route_after_evaluation(state: JokeState) -> str:
    if state["grade"] == "funny":
        return "accept"
    return "retry"

builder = StateGraph(JokeState)
builder.add_node("generate", generate_joke)
builder.add_node("evaluate", evaluate_joke)

builder.add_edge(START, "generate")
builder.add_edge("generate", "evaluate")
builder.add_conditional_edges(
    "evaluate",
    route_after_evaluation,
    {"accept": END, "retry": "generate"},
)

graph = builder.compile()
result = graph.invoke({"topic": "programming", "joke": "", "feedback": "", "grade": ""})

Adding a Safety Limit

Prevent infinite loops by tracking iterations:

class SafeJokeState(TypedDict):
    topic: str
    joke: str
    feedback: str
    grade: str
    attempts: int

def route_with_limit(state: SafeJokeState) -> str:
    if state["grade"] == "funny" or state["attempts"] >= 3:
        return "accept"
    return "retry"

Why It Matters

Measurable quality improvement – each iteration demonstrably improves output based on specific, structured feedback rather than blind regeneration.
Separation of generation and judgment – the generator and evaluator can use different models, prompts, and even temperature settings optimized for their roles.
Composable within larger systems – the evaluator-optimizer loop can be a subgraph inside a supervisor-managed multi-agent pipeline.
Structured feedback prevents drift – typed evaluation schemas ensure the generator receives actionable, consistent feedback rather than vague commentary.

Key Technical Details

The evaluator must use with_structured_output to return a typed Pydantic model, ensuring grades and feedback are machine-parseable.
The conditional edge after the evaluator inspects the grade field to decide whether to loop or exit.
The generator should include previous feedback in its prompt for meaningful iteration, not just retry blindly.
Always add a maximum iteration limit to prevent runaway loops and excessive API costs.
Different models can be used for generation versus evaluation to balance cost and quality.
The pattern works with any evaluation schema: numeric scores, pass/fail grades, multi-dimensional rubrics.
State accumulates all feedback, so later iterations have the full history of improvement suggestions.

Common Misconceptions

“The evaluator and generator must be different LLMs.” They can be the same model with different system prompts. Using separate models is an optimization, not a requirement.
“The loop always produces better output.” Without specific, actionable feedback in the evaluation schema, the generator may oscillate or degrade. The quality of the evaluation prompt matters enormously.
“This pattern is only useful for creative content.” Evaluator-optimizer loops work for code generation, data extraction, SQL queries, translations, and any task where output quality can be programmatically assessed.
“Structured output is optional for the evaluator.” Without structured output, the conditional edge cannot reliably parse the grade. Free-text evaluation breaks the routing logic.

Connections to Other Concepts

Structured Output – the with_structured_output method that powers typed evaluations
Supervisor Pattern – evaluator-optimizer loops often run as sub-agents within a supervisor system
Edges And Routing – conditional edges drive the accept/retry routing decision
Subgraph Architecture – the evaluator-optimizer loop is a natural candidate for encapsulation as a subgraph
Agent Handoffs – the loop can also be implemented using Command-based handoffs between generator and evaluator