Routes the work
Decomposes a goal into a task graph and assigns each node to the model best at it. Rankings flip by task: the top model on a 5-minute job isn't the top one on a 2-hour job — so betting on a single model leaves results on the table.
Every coding agent today is a harness wrapped around one model. CodeRight wraps the best model for each job — routing work across models, reviewing it with a jury that didn't write it, and verifying the goal is actually met. Because the research keeps finding the same thing: the harness moves outcomes more than the model.
Everyone is racing to the best model. The research keeps finding the race is in the harness.
Stanford & MIT, 2025: a small model in an optimized harness out-coded every hand-built harness tested — including ones running larger models. The harness is the part that compounds. The model is the part you replace.
Wrapping a model in a harness is the right idea — everyone does it, including Anthropic. Tools differ in how far the harness reaches: how many different models it can use, whether the work is checked by something other than its author, and whether the harness improves itself.
Work is a task graph, not a chat log — and breaking long work into verified steps is, by the measurements, the highest-leverage reliability move there is. Each step goes to the model best at it, gets reviewed by something other than itself, and only completes against evidence.
Decomposes a goal into a task graph and assigns each node to the model best at it. Rankings flip by task: the top model on a 5-minute job isn't the top one on a 2-hour job — so betting on a single model leaves results on the table.
Every important artifact is validated by a jury that didn't write it. Verdicts are hash-bound — change the artifact and the verdict is void. No agent passes its own homework.
Premature completion — agents declaring victory before the work holds up — is a primary failure mode of long-running agents. So "did we finish the goal?" is its own check, run against evidence, not left to the worker that wrote the code.
Models are interchangeable. A better model slots into a role without redesigning the system. Escalation ladders try a cheap model first and climb only when the work demands it.
Decisions are first-class, write-once objects — annotated, never deleted. The runtime remembers why it's configured the way it is, across months and years.
Model-agnostic over any OpenAI-compatible endpoint. Sessions and keys stay local and encrypted. Native .NET + Avalonia — no Electron, no cloud account, no telemetry.
A model upgrade is a one-time bump you don't control. A harness that watches its own outcomes, benchmarks candidates, and promotes the winners on evidence improves every week — and that automated harness optimization is exactly what recent work showed beating hand-built harnesses. Here it runs under governance: every change is proposed, validated, and approved. Never silent.
Self-improvement is dangerous if it touches production directly. CodeRight's core rule has no exceptions: every change is a proposal, validated and human-approved before it deploys — and some things can never be touched at all.
CodeRight is in active development. Leave your email for the Windows beta — nothing else.
Investing or partnering? hello@coderight.cc
The best coding agent won't have the best model.
It'll have the best harness. Be early to it.