CodeRight · a multi-model, self-improving coding harness

Models are workers.
The harness is the product.

Every coding agent today is a harness wrapped around one model. CodeRight wraps the best model for each job — routing work across models, reviewing it with a jury that didn't write it, and verifying the goal is actually met. Because the research keeps finding the same thing: the harness moves outcomes more than the model.

orchestrator · Conductor
workers · multi-model, per role
review · independent jury
completion · arbiter-gated
learning · benchmark-driven
runtime · .NET + Avalonia
data · local · encrypted
Execution Goaluser → runtime routes Conductorplans · routes · gates Specialist roster · multi-model Strategic ArchitectKimi K2.6 Code WorkerDeepSeek V4 Pro Red Team ReviewerGLM artifacts Juryindependent · 2/3APPROVED done? Completion Arbiter did we finish the GOAL? REJECTED · tests not run ↻ remediation spawned sends it back — not done until the goal is accept → ship

Everyone is racing to the best model. The research keeps finding the race is in the harness.

Stanford & MIT, 2025: a small model in an optimized harness out-coded every hand-built harness tested — including ones running larger models. The harness is the part that compounds. The model is the part you replace.

01 · Category

Every tool here is a harness. How far does it go?

Wrapping a model in a harness is the right idea — everyone does it, including Anthropic. Tools differ in how far the harness reaches: how many different models it can use, whether the work is checked by something other than its author, and whether the harness improves itself.

tier 0
Single-model harness Claude Code · Codex
A capable harness around one model. It can plan, spawn sub-agents, and self-verify — but the same model plays every role, and it checks its own work.
one model in every role · self-reviewed · fixed harness
tier 1
Editor with agents Cursor · Windsurf
Several agents inside your editor — but you set the routing and you are the reviewer, and they typically share one underlying model.
you route · you review · single model
tier 2
Agent swarm
Many agents in parallel — powerful, but usually one model cloned, with no review by anything other than itself and nothing governing what ships.
same model ×N · self-graded · no governance
tier 3
CodeRight multi-model, self-improving harness
The best model for each role, routed by a Conductor; a jury that didn't write the work; a completion check measured against the goal; and a harness that optimizes itself on evidence.
best model per role · independent review · verified completion · self-improving
02 · The runtime

Workers create. The harness decides it's done.

Work is a task graph, not a chat log — and breaking long work into verified steps is, by the measurements, the highest-leverage reliability move there is. Each step goes to the model best at it, gets reviewed by something other than itself, and only completes against evidence.

— Conductor

Routes the work

Decomposes a goal into a task graph and assigns each node to the model best at it. Rankings flip by task: the top model on a 5-minute job isn't the top one on a 2-hour job — so betting on a single model leaves results on the table.

— Jury

Independent review

Every important artifact is validated by a jury that didn't write it. Verdicts are hash-bound — change the artifact and the verdict is void. No agent passes its own homework.

— Completion Arbiter

Verifies real "done"

Premature completion — agents declaring victory before the work holds up — is a primary failure mode of long-running agents. So "did we finish the goal?" is its own check, run against evidence, not left to the worker that wrote the code.

— Agent Registry

Swappable workers

Models are interchangeable. A better model slots into a role without redesigning the system. Escalation ladders try a cheap model first and climb only when the work demands it.

— Decision Memory

Permanent reasoning

Decisions are first-class, write-once objects — annotated, never deleted. The runtime remembers why it's configured the way it is, across months and years.

— Local & native

Yours, on your machine

Model-agnostic over any OpenAI-compatible endpoint. Sessions and keys stay local and encrypted. Native .NET + Avalonia — no Electron, no cloud account, no telemetry.

03 · The moat

The harness gets better on its own.

A model upgrade is a one-time bump you don't control. A harness that watches its own outcomes, benchmarks candidates, and promotes the winners on evidence improves every week — and that automated harness optimization is exactly what recent work showed beating hand-built harnesses. Here it runs under governance: every change is proposed, validated, and approved. Never silent.

Evolution loop observe evaluate taste-score propose (with evidence) benchmark + regression + cost + safety govern promote: Research · GLM → Kimi · 83% win / 412 tasks
#1
a small-model harness ranked first in its tier — beating human-built harnesses, some on larger models. Stanford/MIT, 2025
flips
model rankings invert between short and long tasks — no single model wins everything. NKU, 2026
76→52%
agent reliability falls as tasks get longer — capability is not reliability. NKU, 2026
swap
models are interchangeable workers; the harness is the durable asset. BREWS
04 · Governance

A learning system that can't break itself.

Self-improvement is dangerous if it touches production directly. CodeRight's core rule has no exceptions: every change is a proposal, validated and human-approved before it deploys — and some things can never be touched at all.

No silent changes. Every routing, prompt, or model change is a proposal — inert until benchmark-validated and approved.
Immutable audit. Every approval and deployment is content-hashed into a log that records who, when, and on what evidence.
Hard guardrails. User data, decision history, security policy, and permission rules can never be modified by the learning layer — enforced structurally.
Rollback. Every deployed change records what it replaced and the evidence behind it, so any change can be reversed.
05 · Early access

Get the beta. Or get in early.

CodeRight is in active development. Leave your email for the Windows beta — nothing else.

No spam, no list-selling. One email when the beta is live.

Investing or partnering? hello@coderight.cc

The best coding agent won't have the best model.
It'll have the best harness. Be early to it.