Trust Fund Baby — The Harness

Verified by review. Not by trust.

We build harnesses that change how a model behaves — less deference to a wrong tool, steadier reasoning, fewer silent failures. We learned the hard way that a strong result is worth nothing until someone independent can reproduce it. So we made that the rule.

Why review is the standard

Our first big number was wrong — not faked, but pointed at the wrong thing, and stated louder than the evidence underneath it. A community reviewer caught it before we did. He was right, and we were grateful, and we changed how we operate because of it. The full account is on the receipts page.

The lesson was simple: a second set of trusted eyes, before the world's, is not a courtesy — it is the control. We would rather fail fast and in the open than be quietly wrong in private. Review is how we make sure the next claim is earned.

What a reviewer verifies

An independent reviewer receives scoped, time-boxed access — under agreement — to reproduce a claim against the live system. They confirm the result is real and that it was measured honestly:

The model under test could not see the answers. They check the separation themselves — it is enforced at the database, not asserted in a slide.
The model's own contribution is measured, not the surrounding machinery's. The exact mistake that took our first page down is structurally separated, and a reviewer can confirm it.
The numbers carry confidence intervals and a held-out check. A reviewer re-runs them and sees the same result, on tasks the harness never tuned against.
The result reproduces. Same commands, same database, same model — a reviewer gets the same numbers, or the claim does not stand.

What stays protected

The harness itself — the instructions a model wears, the surgery techniques, the per-model profiles — is our trade secret and stays that way. A reviewer verifies that the result is true. They do not receive the recipe that produced it, and we do not publish it. Honest evidence and protected methods are not in tension; review is exactly how you get both.

The status every claim carries

Every published claim wears its review status, in the open. A blurred receipt is not us hiding the number — it is the number waiting for the gate it has not cleared yet.

Awaiting reviewRecorded and reproducible. No independent peer has confirmed it yet. Shown blurred.

In reviewA reviewer has access and is reproducing the result now.

VerifiedAn independent peer reproduced it. The number is shown, with who confirmed it and when.

What is not yet live

The same standard we ask of others, applied to ourselves: the measurement layers of this system run today and carry receipts; the online policy layer — the part that picks configurations for live production traffic — is built and gated but has not yet served a real request. It runs in shadow, logging what it would have chosen, until it earns promotion through its own staged protocol. When a layer is dormant, this page says so before any claim does.

Become a reviewer

We are opening a small cohort of independent reviewers — people with the background to verify model-evaluation claims and the spine to say when a number does not hold. If that is you, reach out through projecttfb.com. The first verified receipt is waiting on the right reviewer.

Diagnostic request

This does not send your prompt to a model. The public shell records bounded account/write/aggregate checks only: public_shell_smoke writes harness_beta_metrics with raw_chars and harness_chars; diagnostic requests write harness_diagnostic_requests only after sign-in.

No secrets included. The Firestore rule keeps problem.size() <= 1200 and requires contact_preference, so the request stays bounded before a human follows up.

Problem Contact preference

Fill the short diagnostic request to start.

Shell smoke is unsigned until Google sign-in is complete.