Harness Economics & Framework Showdown

Cost models, model routing, prompt caching, SWE-bench leaderboards, and side-by-side comparisons across the major harnesses and frameworks.

Choosing Your Harness Stack

The capstone decision: pick a harness (interactive surface), decide whether you need an orchestration platform on top (multi-agent / autopilot), pick an SDK if you’re building rather than using, and lean on MCP and configuration files to keep the choice reversible — most of the cost of getting it wrong is portability cost, which is partly mitigable.

Claude Code vs. Codex CLI vs. Cursor

Side-by-side comparison of the three dominant single-developer coding harnesses in 2026 — Claude Code (terminal-first, hooks-rich, sub-agent-capable), Codex CLI (terminal-first OpenAI counterpart, simpler primitives), Cursor (IDE-tight, agent-mode autopilot, IDE-shaped extensibility).

Harness Cost Models

A harness’s cost is dominated not by per-token model price but by how often it calls the model, how aggressively it caches the prefix, when it falls back to a cheaper model, and how many sub-agents it parallelizes — these are harness-level decisions, not model-level ones.

LangGraph vs. AutoGen vs. CrewAI

Side-by-side comparison of the three dominant agent frameworks in 2026 — LangGraph (graph-based, explicit state, production-leaning), AutoGen (conversational multi-agent, dialogue-centric), CrewAI (role-based, opinionated, approachable) — each shines for different problem shapes and team backgrounds.

Model Routing in Harnesses

Model routing is the harness-layer decision of which model handles which turn — a small fast model for routing/classification, a large smart model for hard reasoning, a code-tuned model for coding subtasks; routing is the second-largest cost lever after caching, and a major source of harness differentiation.

OpenAI Agents SDK, Mastra, and Google ADK

The 2025 “second-wave” of agent SDKs — OpenAI Agents SDK, Mastra (TypeScript-first), and Google ADK (Agent Development Kit) — converged on a similar shape: opinionated agent + handoff + guardrail primitives sitting between bare API calls and a full framework like LangGraph; a useful comparison if you’re picking an SDK.

Prompt and Context Caching

Prompt caching reuses computation for repeated prefixes — system prompts, long instructions, recently-seen documents — at 5–10× cost savings on cache-hit tokens; it is the single largest cost lever in any agent system, and harness-layer prompt structure determines whether you actually capture it.

SWE-bench and Harness Leaderboards

SWE-bench is the dominant agent benchmark for software engineering tasks, and harness leaderboards (top scores published by ruflo, Aider, Devin, Cursor, OpenHands) are how the harness-layer competition is now measured — a 2026 frontier harness scoring 80%+ on SWE-bench Verified is roughly a year-over-year doubling of capability.

The 75% Savings Claim

Ruflo’s headline claim of “75% API cost savings vs. Claude Code direct” is plausible but conditional on workload — the savings come from prompt caching discipline + multi-provider routing + parallel tool calls + cheaper-model fallback; this concept audits the claim and shows where it does and doesn’t hold.