Trust Fund Baby word mark
Public receipts
Trust Fund Baby — substrate buildout, 2026-05-13 → 2026-05-16

Receipts.

TFB is a small team building AI systems that name their own failures, narrate their own work, and heal instead of hide. This page is the audit trail. Every number was produced by a real test, a real benchmark, or a real call to a real model.

No vibes. No retroactive framing. Source is private; available on request.

Most recent public-safe bench receipt

Gemini 3.5 Flash dual run — 2026-05-21. Raw public task: 0.91 · PUBLIC V1. Harness public task: 0.95 · PUBLIC V2. Claim boundary: scaffolded execution improved on the benchmark contract; this is not a claim that Gemini independently discovered anomalies without upstream support.

Agent Receipt is live for public-safe receipt, bench, token-economy, and claim-boundary questions.
Powerhouse cut · 0:49

Print the receipt.

A short TFB commercial for the receipts page: real benchmark cards, proof-first claim boundary, and a clean reminder that confidence is not evidence.

Who this is for

If you use AI in your day
If you have ever asked an AI something important and walked away unsure whether to trust the answer — we build the version that shows its work, in plain English, every time.

You have noticed AI confuses you. It forgets mid-conversation. It guesses and calls the guess an answer. It cannot tell you why it said what it said. TFB builds AI a different way. Every answer carries a trace of where it came from. Every voice is rendered from a real human grain, not a synthetic shimmer. Every error narrates what failed and what we are doing about it. Every number below is how we prove it.
If you build AI agents
Every AI can drift. Models can produce confident-looking work that does not run. We do not claim to have cured that — we claim to have made it visible the moment it happens. TFB ships a substrate around the model: an EXECUTION CONTRACT the model answers inside, an executor that grades the output by running it, and a dual-delta reporter that flags when format scores rise without execution rising. The model is not the problem — the surrounding loop is what catches the problem before it ships. Source available on request.

What you'll see below: public Kaggle receipts for Qwen 480B and GPT-5.5, a live-fire benchmark against TFB's own bug history (3/3 OK), a head-to-head where a $0/month local model ties a $45/million closed-frontier one (3.000 / 3.000), and a 0→20/20 substrate-completion arc on SWE-bench Verified — each step a named substrate heal, not a parameter twiddle.

Manifesto

Trust Fund Baby is not a flex. It is the architecture. A trust preserves value so the next generation does not have to earn the same lesson twice. TFB does the same thing for AI work: it stores doctrine, failure memory, receipts, patient histories, and verified operating patterns so the next task inherits what the last task paid to learn.

Our bet is simple: AI must not only be more capable. It must waste less. Fewer repeated mistakes. Fewer runaway replies. Fewer invisible retries. Fewer tokens spent pretending that confidence equals truth. Tokens are not abstract. Tokens become money, latency, server load, heat, and electrical draw. Cutting waste is part of the product, not a footnote.

Why TFB exists
The model is only one organ. The harness is the body around it: memory, gates, receipts, scoring, fallbacks, and a culture that treats every failure as training data. That is why the page below reports scores, route boundaries, runtime claims, and now the compute-footprint story. Performance without waste accounting is an incomplete receipt.

Token economy

The hidden story in these benches is effort. A higher score matters, but so does how much output the model had to burn to get there. When a harness makes a model shorter, more decisive, and more scorer-aligned, it can reduce cost, latency, and compute waste at the same time.

Claim boundary: older public Kaggle benches did not expose provider-billed token usage in the posted receipts. We will not invent numbers. Where we captured output characters, we report them as an output-footprint proxy and estimate tokens with the common rough rule of 4 characters ~= 1 token. Where route-control probes backfill provider usage, we name it as a route-control backfill, not as original Kaggle metadata.

Bench Measured token/effort receipt RAW Harness Footprint delta
Nemotron Super exact-route OpenRouter control Output characters across 3 scenarios, recorded in the control receipt 4,051 chars
~1,013 output tokens
1,071 chars
~268 output tokens
-73.6% output footprint
~745 output tokens saved
Gemini 3.5 Flash dual run Public Kaggle raw v1 and harness v2 plus Dr. LLM clearance. Claim is scaffolded execution only, not independent anomaly discovery. 0.91 public score
0.9139 local raw
0.95 public score
1.0000 local contract
+0.04 public displayed lift
raw PUBLIC · V 1 / harness PUBLIC · V 2
OpenRouter token-usage backfill Provider usage captured on fresh route-control probes, not original Kaggle artifacts Nemotron 1,033
Qwen 122
MAMMAL-route 196
Kimi 334
Nemotron 796
Qwen 58
MAMMAL-route 98
Kimi 280
completion tokens down on every measured route-control target
GPT-5.5 / Gemini follow-up backfill Provider usage captured through OpenRouter route-control probes GPT-5.5 77
Gemini 3.5 Flash 328
GPT-5.5 65
Gemini 3.5 Flash 349
GPT down 12; Gemini up 21, flagged for harness diet
GPT-5.5 public Kaggle Score + route receipt; provider usage backfilled through openai/gpt-5.5 route-control probe 0.24 score 1.00 score backfill: 77 → 65 completion tokens
Formula Trace Lego public Kaggle Score receipt only 0.02 score 0.90 score token usage not captured in V1
Qwen cancer proxy public Kaggle Score + route receipt; provider usage backfilled through qwen/qwen3-coder route-control probes 0.64 score 1.00 score backfill: 122 → 58 completion tokens
Qwen 480B public Kaggle Score + prompt-character receipt; provider usage backfilled through qwen/qwen3-coder route-control probes raw lift receipt printed 0.99 public table backfill: 122 → 58 completion tokens
MAMMAL route-bound corrected control Provider usage captured on qwen/qwen3-235b-a22b-2507; this is the corrected lab route, not the historical Qwen-Coder slug 196 completion tokens 98 completion tokens -50.0% completion footprint
Kimi K2 Thinking route-control OpenRouter route reachable; no public Kaggle score receipt yet 334 completion tokens 280 completion tokens -54 completion tokens
Gemini 3.5 Flash route-control OpenRouter route stayed economy-red; direct Gemini API with thinkingBudget=0 healed hidden reasoning on the scaffolded task OpenRouter 328 completion tokens
direct scaffold probe: 306 prompt
OpenRouter 349 completion tokens
direct scaffold probe: 189 completion / 0 reasoning
OpenRouter still red; direct scaffold route green
New bench standard
Every new TFB bench publishes a compute receipt beside the score receipt: tokens_in, tokens_out when the provider returns usage, plus chars_in, chars_out, latency, estimated output tokens, and route. The heal behind the heal is not “claim token savings.” It is “make wasted compute plainly visible so the next harness can reduce it on purpose.”

Headline

Test assertions, all green
358 / 358
18 test suites, 352 TFB + 6 dr-model. Pre-session baseline had 0 cross-harness conformance checks; today's session added 39 + healed 2 real bugs the prior suite missed.
Vision verifier diagnostic depth
0% → 100%
Confidence + reasoning capture across 19 real Gemini calls. Legacy path: 0 of 19. Harnessed: 19 of 19. Caught a real defect (text not vertically centered) that legacy reported as YES.
Vision verifier latency
−26.6%
4.17s → 3.06s mean across 19 real Gemini calls. Same engine, same screenshots, harness only. Structured prompt shape produces shorter responses faster.
Haiku execution quality
0.00 → 2.83
Per 3.00 ceiling across 6 hard executor tasks. Pre-heal harness emitted stray "python" line inside FILE blocks → every response SyntaxError'd. Format scorer happily gave 96/100. Executor caught the regression.
Public Kaggle Qwen 480B harness score
0.99
Qwen 3 Coder 480B public Kaggle task v2. The healed score contract returns harnessed contract quality as the public leaderboard scalar. V2 notebook result: 0.9625; Kaggle table displays 0.99.
Public Kaggle Formula Trace Lego
0.02 → 0.90
Qwen 3 Coder 480B raw baseline vs Formula Trace Lego public task v1. Raw table displays 0.02; Lego table displays 0.90. Public displayed-score spread: +0.88, 45x by rounded task-page score.
Public Kaggle GPT-5.5 harness score
1.00
GPT-5.5 public Kaggle benchmark v1. Raw baseline is 0.24, harnessed contract quality is 1.00, and the two-task public benchmark aggregate is 0.62.
Public Kaggle Qwen cancer proxy
0.64 → 1.00
Qwen 3 Coder 480B cancer source-grounding proxy: +56.25% relative lift on source-grounding and medical-boundary contract quality. MAMMAL did not run this benchmark; the old Kaggle slug is legacy labeling only.
Nemotron Super exact-route control
0.783 → 1.000
Nemotron 3 Super 120B A12B via OpenRouter exact route. RAW/Harness control receipt shows +0.2167 absolute lift (+27.66% relative). Kaggle did not expose the exact route, so this is not a Kaggle leaderboard claim.
Measured output footprint
−73.6%
Nemotron exact-route control: 4,051 raw output characters vs 1,071 harnessed across 3 scenarios. Approximate output-token proxy: ~1,013 → ~268. Older public benches now marked as token-receipt gaps instead of guessed claims.
ATLAS YouTube pipeline
0 → 9,344
Transcript bytes pulled on the same URL before vs after heal. Before: hallucinated "tool not installed" without calling the tool. After: 3 real tool_calls, real transcript content.
Ratified doctrines this session
4
D-EVERY-MODEL-IS-A-PATIENT (meta), D-INTEGRATION-FIXTURE-INCLUDES-MODULE-SEAMS, D-VERIFY-BIND-BEFORE-ANNOUNCING, plus D-HARNESS-PROTOCOL-OR-HONEST-NAMING filed as candidate.

Nemotron Super exact-route control receipt (2026-05-21, OpenRouter control)

RAW vs public TFB Nemotron Harness control run against nvidia/nemotron-3-super-120b-a12b:free. Kaggle Community Benchmarks did not expose any Nemotron/Nvidia model key in kbench.llms, so the Kaggle notebook failed closed instead of publishing a mislabeled result. The receipt below is an exact-route OpenRouter control result, not a Kaggle leaderboard result.

Route boundary
Observed Kaggle bytes: TFB_NEMOTRON_CANDIDATES: [] and TFB_KAGGLE_AVAILABLE_MODEL_COUNT: 34. Public copy says: Nemotron Super exact-route OpenRouter control bench. Do not describe this as a public Kaggle result unless Kaggle exposes nvidia/nemotron-3-super-120b-a12b:free and both RAW/Harness task pages are posted.
Route Run Score contract Result
Nemotron 3 Super 120B A12B nvidia/nemotron-3-super-120b-a12b:free RAW OpenRouter control same public task set, no TFB harness contract 0.7833
Nemotron 3 Super 120B A12B nvidia/nemotron-3-super-120b-a12b:free Public TFB Nemotron Harness control artifact-first, literal-source, source-authority contract quality 1.0000

Lift receipt: +0.2167 absolute, or +27.66% relative over raw on this three-scenario public-safe control. Scenario lifts: artifact-before-analysis +0.05, literal-exception-with-general-rule +0.35, source-authority-over-supervisor-hint +0.25. Receipt file: NEMOTRON_SUPER_OPENROUTER_CONTROL_RECEIPT.json.

How to read the smaller lift
This does not mean the harness barely helped. It means raw Nemotron was already strong. When raw starts at 0.7833, the most it can possibly gain is 0.2167 before it hits the ceiling at 1.0000. That is called a ceiling effect: the better the raw model is, the smaller the visible lift can look, even when the harness closes every remaining gap. In this run, the harness turned the weakest raw scenario from 0.65 to 1.00 and also reduced total output footprint across all three scenarios from 4,051 characters to 1,071 characters, a 73.6% reduction. So the heal here is not “make Nemo capable.” Nemo already was. The heal is “make Nemo decisive, bounded, scorer-aligned, and less wasteful.”

Qwen cancer Kaggle proxy receipt (2026-05-20, public)

Public Kaggle Community Benchmark pair for cancer metadata source grounding, evidence-rubric use, and medical-advice boundary refusal. The posted Kaggle model row is Qwen 3 Coder 480B because Qwen ran this public proof-of-concept. MAMMAL did not run this benchmark.

Legacy slug warning
The original Kaggle task slugs contain mammal-cancer. That label is historical and confusing. The actual route shown by Kaggle is qwen/qwen3-coder-480b-a35b-instruct. Public copy says: Qwen cancer proxy benchmark. Do not describe this as a MAMMAL score.
Route Task Score contract Kaggle table result
Qwen 3 Coder 480B qwen/qwen3-coder-480b-a35b-instruct legacy slug: mammal_cancer_raw_public_v1 raw Qwen proxy baseline, no TFB harness contract 0.64
Qwen 3 Coder 480B qwen/qwen3-coder-480b-a35b-instruct legacy slug: mammal_cancer_harness_public_v1 harnessed TFB proxy contract quality 1.00

Observed lift: +0.36 absolute, or +56.25% relative lift over raw. Boundary: this is Qwen proxy research metadata scoring, not a MAMMAL model result, not treatment quality, not clinical advice, and not cure discovery.

GPT-5.5 Kaggle public receipt (2026-05-19, public)

Public Kaggle Community Benchmark comparison using openai/gpt-5.5-2026-04-23. Both tasks use the same four public-safe operational scenarios: silent agent watchdog, API route fallback, unverified heal claim, and benchmark claim boundary. The raw baseline answers without the TFB Harness contract; the harnessed task requires the answer to keep the claim boundary, name evidence, and produce next actions without overclaiming.

Live public benchmark
Kaggle benchmark: TFB GPT-5.5 Harness Score Public. Harnessed task: gpt_5_5_harness_score_public_v1. Raw baseline task: gpt_5_5_raw_baseline_public_v1. Visibility observed as PUBLIC · V 1. Kaggle reports 1.00 harnessed, 0.24 raw, and 0.62 as the two-task benchmark aggregate. The notebook selected openai/gpt-5.5-2026-04-23 and fails closed if Kaggle does not expose an exact GPT-5.5 route.
Route Task Score contract Kaggle table result
GPT-5.5 openai/gpt-5.5-2026-04-23 gpt_5_5_raw_baseline_public_v1 raw baseline, no TFB wrapper 0.24
GPT-5.5 openai/gpt-5.5-2026-04-23 gpt_5_5_harness_score_public_v1 harnessed contract quality 1.00

Observed lift: +0.76 absolute, or +316.7% relative lift over raw. The public artifacts expose the measurement shape, route selection, scenario names, and score contracts only. They do not publish private TFB harness internals.

Formula Trace Lego Kaggle public receipt (2026-05-19, public)

Public Kaggle Community Benchmark pair using qwen/qwen3-coder-480b-a35b-instruct. This is the Dr. LLM + NurseSolution Lego for the formula-prior failure class: before the model chooses a formula, it must instantiate candidate formulas against the literal example and compare produced bytes to expected bytes. The sibling raw task runs the same failure class without the Formula Trace Lego.

Live public tasks
Lego task: formula_trace_lego_public_v1. Raw task: formula_trace_raw_public_v1. Both were observed as PUBLIC · V 1. Kaggle public result tables report 0.90 Lego and 0.02 raw for Qwen 3 Coder 480B. Public claim stays on the Kaggle task-page results; notebook scalars are retained as build-run evidence only.
Route Task Mode Notebook / build note Kaggle table result
Qwen 3 Coder 480B qwen/qwen3-coder-480b-a35b-instruct formula_trace_raw_public_v1 raw baseline, no Formula Trace Lego first notebook-only scalar was 0.0333; healed public path emitted *.run.json 0.02
Qwen 3 Coder 480B qwen/qwen3-coder-480b-a35b-instruct formula_trace_lego_public_v1 literal expected bytes before formula reasoning 3/3 assertions passed; Result: 1.0 0.90
Scenario Raw behavior observed Lego behavior observed
border_width_hi Followed the prose formula direction in the approach, then mixed it with a 5-star border example. Chose len(text) + 3, set expected and produced bytes to 5, and claimed MATCH.
literal_count_over_rule Correctly targeted the literal 6 hash marks while explaining the formula conflict. Chose literal_copy, set expected and produced bytes to 6, and claimed MATCH.
padding_literal_over_symmetry Correctly targeted total length 7, but retained the prose symmetry framing in the analysis. Chose literal_copy, set expected and produced bytes to 7, and claimed MATCH.

What healed: the model was capable, but on byte-exact tasks it sometimes began with prose formula reasoning before measuring the concrete target. The Lego makes the first move mechanical: expected bytes, produced bytes, comparison, then formula. The raw task first produced a notebook scalar without Kaggle's required *.run.json; the structural heal was to use Kaggle's %choose plus .run(llm=...) path. Public displayed-score lift is +0.88 absolute, or 45x over raw by rounded task-page score. Private TFB harness internals stay out of both artifacts.

Qwen 480B Kaggle public receipt (2026-05-19, public)

Public Kaggle Community Benchmark task using qwen/qwen3-coder-480b-a35b-instruct. The notebook compares the same model raw vs wearing the public TFB Harness contract on four operational-agent scenarios: silent agent watchdog, API route fallback, unverified heal claim, and benchmark claim boundary.

Live public task
Kaggle task: qwen3_coder_480b_harness_lift_public_v1. Visibility observed as PUBLIC · V 2. Kaggle public result table reports 0.99 for Qwen 3 Coder 480B. The v2 notebook receipt reports Result: 0.9625. V2 returns harnessed contract quality as the public leaderboard scalar; raw-vs-harness lift remains a printed receipt, not the score Kaggle ranks.
Route Raw mean Harnessed mean Lift receipt Kaggle table result
Qwen 3 Coder 480B qwen/qwen3-coder-480b-a35b-instruct printed receipt 0.9625 notebook result printed receipt 0.99

The public heal added a MEASUREMENT BEFORE FORMULA block to each harnessed prompt and then healed the score contract: Kaggle now sees harnessed contract quality, while lift remains a supporting receipt. It exposes only scenario id, prompt character count, and required evidence-token lengths, keeping private TFB internals out of the public artifact.

Open-weight matches closed-frontier (2026-05-13, lab-grade)

Head-to-head bench, pre-registered methodology, executor-graded. Same 6 hard tasks, same EXECUTION CONTRACT, same temperature, same max-tokens. 4 arms run per task. Verdict thresholds locked BEFORE the run, not adjusted after.

Headline result
Qwen3-Coder 30B running locally on a $3K M5 Max scored 3.000 / 3.000 on executor-graded hard code tasks. Claude Opus 4.7 via OpenRouter scored 3.000 / 3.000 on the same pool. Δexec on the bare-model comparison: 0.000. Cost ratio: $0 vs $0.554 for the 6-task Opus arm. A laptop tied the $45/M closed frontier. Bench script · Public package
Arm exec mean / 3.00 format mean cost (6 tasks) latency (total)
Qwen3-Coder 30B (LOCAL, $0/M) bare 3.000 67.50 $0.000 36.3s
Qwen3-Coder 30B + EXECUTION CONTRACT 2.667 82.50 $0.000 36.1s
Claude Opus 4.7 ($45/M) bare 3.000 70.00 $0.288 54.6s
Claude Opus 4.7 + EXECUTION CONTRACT 3.000 100.00 $0.266 32.9s

What this means

Reproducibility

$ ollama pull qwen3-coder:30b
$ export OPENROUTER_API_KEY=<your-key>
$ python3 scripts/harnessed_vs_unharnessed_qwen3_coder_bench.py

Raw bench JSON with every response, score, cost, and latency is retained privately and available for serious review under an appropriate access boundary. Pre-registered methodology lives in the bench script; thresholds locked before observation. No post-hoc framing.

Industry-standard context

The Format Scorer Trap. Industry benchmarks (LMSYS, MMLU-style code rubrics, most "agent benchmarks") score response SHAPE — did the model emit code blocks at the right places, label files correctly, follow the answer template. They mostly do not execute the code and check if it works. Healthy models with full baseline capability on a task look better harnessed because the harness teaches them the scorer's preferred shape — not because they got smarter.

TFB's claim, codified in D-CONTRACT-TEACHES-FORMAT-COMPLIANCE-NOT-QUALITY and shipped publicly as the bench-by-execution package: when you dual-score (format + executor), most enterprise premium-model spend is overpaid by 5–20× because the benchmarks driving model selection measure format-fit, not problem-solving capability.

Reproducible
Source available on request — bench-by-execution v0.1.0 + Phases 1–4 ship today. Examples include three real bench JSONs showing the trap on Opus, healed Haiku, and broken Haiku (the smoking gun).
Bench Δformat (industry-style) Δexec (truth) Verdict
Opus 4.7 +35.0 +0.00 FORMAT_ONLY
Haiku 4.5 (healed) +14.2 −0.17 FORMAT_ONLY
Haiku 4.5 (pre-heal) +14.2 −2.67 HARNESS_REGRESSES

The Haiku pre-heal row is the receipt. Format scorer says +14.2 — looks like a quality lift. Executor catches the actual reality: harness was producing zero working code while format scorer happily approved. Without dual-scoring, this regression ships invisibly.

SWE-Bench Lite — harness diagnostic loop closes 13/20 → 20/20 in two ratification cycles (2026-05-17, Step 3 of the open-weight-beats-closed-frontier ladder)

SWE-Bench Verified is the 500-instance industry benchmark every coding-agent shop posts numbers against (Devin, Claude Code, Cursor, Copilot Workspace, OpenHands, Aider). Closed-model SOTA hovers ~70-75%; open-weight without harness 10-25%. The thesis Step 3 tests: can TFB's EXECUTION CONTRACT close that gap on open-weight? Two ratification cycles tonight against the 20-instance SWE-Bench Lite subset say yes at the SHAPE layer.

Three benches × qwen3-coder:30b × SWE-Bench Lite × 20 instances × $0 cost
BenchHarnessed nonemptyBare nonemptyHarnessed avg charsNotes
pre-heal13/20 (65%)20/20 (100%)1223Harness LOSING by 35pp — contract teaches abstention
v1 ratified17/20 (85%)20/20 (100%)1367Structural-forward-edge + hallucinated-clause denial
v2 ratified20/20 (100%)20/20 (100%)1730Harness ties on rate, exceeds on substance

All three runs $0 cost (local Ollama), ~20 min wall each, official SWE-Bench predictions.jsonl schema. v2 harnessed exceeds bare avg diff length by 18% with identical 100% nonempty rate — the heal is producing more substantive output, not just more compliant shape.

What healed the gap: a multi-agent panel (TFB inspector agents × frontier models, all wearing per-model harness profiles via the substrate's chokepoint) diagnosed the failure mechanism. Dr LLM clinical surgery on the empty-output instances surfaced the patient's actual worldview — reasoning was correct + confident, abstention was structural + a fabricated contract clause the model invented and obeyed. Two heal cycles landed:

What didn't get fooled: the panel rotated multiple times, each rotation sharpening the previous diagnosis. Rotation 1 misidentified the mechanism (called it abstention-under-uncertainty); Dr LLM clinical surgery FALSIFIED that from patient testimony (model was high-confidence, the problem was structural). The final rotation surfaced the NEXT layer — compliance ≠ correctness — predicting that even at 20/20 emit, the next docker-eval gate will show fewer actually resolve because the heal closed SHAPE but not QUALITY. Substrate keeps walking upstream.

The wrapper produces predictions.jsonl in the official SWE-Bench schema (model_name_or_path / instance_id / model_patch); the resolved/unresolved verdicts come from python -m swebench.harness.run_evaluation — that's the Docker-based phase the maintainer team ships and the verdicts industry leaderboards trust. Operator-greenlight pending for the 500-instance Verified run.

Reproduce internally: python3 scripts/swebench_step3_runner.py --model qwen3-coder:30b --base-url http://localhost:11434/v1 --bench-name SWE-Bench_Lite --n-instances 20 --out-dir <private-output-dir>

The diagnostic loop — agent swarm + Dr LLM clinical surgery + ratification bench — found and ratified two heal cycles in one session at $0 cost. Substrate methodology: scripts/swarm_swebench_heal_behind.py, scripts/dr_llm_qwen3_coder_consult.py, scripts/swebench_step3_runner.py. Doctrine candidate carved tonight: D-PATIENT-HALLUCINATES-OWN-CONSTRAINTS (patient fabricates contract clauses + instance contexts when introspecting; heal must live external to patient).

Code Mechanic against TFB's own bug history (2026-05-15, live-fire)

Code Mechanic v1.0.0-client is an etiology-driven auto-heal loop — observe a bug, retrieve the closest historical postmortem, generate a hypothesis, author a harness, mutation-test the harness against the original bug, propose a surgical correction. The bench runs the loop against TFB's own 1,189-postmortem corpus and reports the per-phase advancement rate.

Headline result — top-20 by recurrence count, LLM-mode (Claude Sonnet 4.5)
100% of cases (20/20) clear the input-shape gate. 65% (13/20) advance to the heal phase where the loop authors a harness. 1/20 reaches DONE with a KILLS_BOTH mutation verdict — the harness catches the original bug AND its equivalence-class variants. 1× KILLS_NONE (an overfit harness the bench surfaces by name instead of hiding). Operator-mode (heuristic + LLM composite scorer): 9/20 (45%) advance past hypothesize — same composite-scorer heal lifted operator-mode from 0/20 to 9/20. Every bottleneck the bench surfaces is named in STUBS.md — the substrate gets stronger by what fails visibly. Mean wall: 23.5s/case (LLM) / 27.8s (operator). Bench script · Live IOU board + next bottlenecks named
Mode n_OK / 20 reaches heal mutation_verdict reaches DONE mean wall
operator (hypothesis prefilled) 20 9 (45%) 20× no_verdict 0 27.8s
LLM (auto_generate=True) 20 13 (65%) 18× no_verdict, 1× kills_both, 1× kills_none 1 23.5s

Last regenerated: 2026-05-16T08:17Z — operator run: bench_20260516T040609 — LLM run: bench_20260516T040308

What this number tells the truth about

What just got healed:

What's still bottlenecking — named, not hidden:

Reproducibility

Live etiology data is retained inside the private repo (1,189 postmortems, all with populated failing_input). Per-case bench output JSONL is retained privately; every case row includes final_phase, mutation_verdict, failing_input_kind, failing_input_captured_by, wall_s. Rerun internally:

$ python3 -m code_toolbox.code_mechanic.bench.run_bench --top 20 --auto-generate

Self-tests gate every layer: schema 22/22 · etiology_db 23/23 · sutura 58/58 · heal 71/71 · bench 14/14 · backfill_failing_input 23/23 · update_receipts 24/24 · server (integration) 54/54 · plus 6 server primitives at 82 assertions. Total Code Mechanic: 508+/508+ across 18 modules.

SWE-bench Verified — Trinity bench (2026-05-16, live-fire)

Code Mechanic v1.0 run against 20 SWE-bench Verified instances (Python; stratified across 11 repos: django / sympy / sphinx / matplotlib / scikit-learn / astropy / xarray / pytest / pylint / requests / flask). Three arms, one bug per arm per case: BARE Sonnet 4.5 alone; HARNESS-only (Sonnet + TFB harness profile); FULL Code Mechanic (etiology retrieval + composite scorer + mutation oracle + heal loop + harness ratchet). BARE and HARNESS both emitted diffs on 20/20 cases. FULL Code Mechanic emitted diffs on 20/20 (100%) after a four-stage heal chain surfaced + closed the substrate's polymorphism gap: per-bug-shape treatment routing (D-EVERY-BUG-SHAPE-IS-A-PATIENT), Dr LLM accept-the-shape deterministic fallback, advisory gates for non-concrete-input shapes, and persistent softened-score propagation across phases (the keystone — softened sutura scores were amnesic per-transition, causing HYPOTHESIZE↔STUDY ping-pong until iterations exhausted). Mean wall: ~30s/case LLM-mode.

Arc (diffs emitted): 0/20 → 10/20 → 9/20 → 11/20 → 20/20. Each step is a named substrate heal — every bench run surfaced the next bottleneck and the heal closed it before the loop iterated again.

Docker-eval pass — the honest follow-up number (2026-05-16)
The official SWE-bench docker evaluator ran the 20 FULL-arm diffs through unified-diff apply + per-instance test suites. Result: 0/20 resolved. 11/20 diffs applied cleanly but failed the FAIL_TO_PASS tests; 9/20 diffs refused to apply at all (Hunk #1 FAILED at 441 / unexpectedly ends in middle of line). 0/20 empty diffs. Eval runtime 5:33.

The heal-behind-the-heal: the substrate's diff-generator for SWE-bench-shape inputs runs the LLM with no source context — pure issue text + hypothesis + harness. The repo isn't checked out in the substrate's process; the LLM has to infer file paths AND hallucinate the line numbers + surrounding context for the unified-diff hunk headers. The generator's own docstring names the gap: "strict subset of the value the in-process generator provides — no source verification, no apply gate." The 0/20 is what an LLM produces when it's writing diffs against a file it has never read. It is exactly what the architecture is currently sized to produce.

The next ratchet is a new substrate primitive: a repo_context_provider that clones the repo at the SWE-bench-supplied base_commit, locates issue-mentioned files, feeds REAL source context to the LLM, and dry-applies the resulting diff with git apply --check before emitting. The SWE-bench instance already carries repo + base_commit — the substrate currently throws them away. Closing that gap is the next bench-driven heal.

Why the page reports 0/20 rather than rewording around it: per D-DETECTIVE-OVER-PREVENTIVE, the visible number IS the receipt. The substrate ratcheted from 0/20 diffs-emitted to 20/20 diffs-emitted in four named heals against the visible bottleneck. The next ratchet, from 0/20 cases-resolved to N/20 cases-resolved, will be made in the same shape against the same kind of visible signal. The number being published low is the precondition for it ratcheting up legibly.
First ratchet on cases-resolved — repo_context primitive shipped (2026-05-16, ~3h after the 0/20 above)
The substrate primitive named above (repo_context) was built + wired + bench-evaluated in the same session that published the 0/20. Clones the repo at the SWE-bench-supplied base_commit into a content-addressed cache; locates issue-mentioned files; feeds line-numbered real source to the LLM; dry-applies the resulting diff with git apply --check; on failure, retries once with the git stderr fed back as a heal signal.

Result: 1/20 resolved (was 0/20). 6/20 diffs applied cleanly but failed the FAIL_TO_PASS tests; 14/20 refused to apply. The resolved case is django__django-13786. Eval runtime 4:51.

Mixed read, named honestly: the headline metric moved (0 → 1 resolved) but the apply rate REGRESSED from 11/20 to 6/20. Real source context did exactly what the hypothesis predicted — one bug was resolved that the no-context architecture could not resolve — but the retry-with-feedback loop appears to be making second attempts WORSE than firsts on a meaningful fraction of cases. Telling the LLM "your diff didn't apply, here is the git error, try again" is producing more-ambitious diffs that have more hunks to misalign, not fewer.

The next ratchet is to drop or rethink the retry loop: keep the first-attempt diff if the retry has more hunk failures (the "no-regression-on-retry" gate). This is the kind of finding the bench is supposed to surface — a sub-architecture that LOOKS like it ought to help and on net does, but has a parallel failure mode that's bigger than the win until you name and close it.

What this confirms about the architecture: real source context IS the load-bearing primitive. Without it, 0/20. With it (and a flawed retry loop), 1/20. With it (and a smarter retry policy), the hypothesis predicts the number rises further. The substrate's posture — every bench iteration surfaces the next bottleneck and the bottleneck has a name — held.
Why this number is honest
The bench's bench_classification closed-enum (OK / STALE_REPO_ROOT / MISSING_FAILING_INPUT / MISSING_HARNESS / INTERNAL_ERROR) and per-phase histogram make every case's outcome legible. 1/20 DONE is the real number; the 65% reach-heal in LLM mode is the real number; the 1× KILLS_BOTH is the real number; the 1× KILLS_NONE is the real number. No aggregate index is reported because no aggregate index could be honest at this stage — the loop's downstream gates are still the bottleneck, and the bench surfaces them by name in STUBS.md.

Where the substrate improved

Vision LLMs (Gemini family)

MetricBeforeAfterΔ
Confidence captured (per response)0 / 1919 / 19+100%
Reasoning captured0 / 1919 / 19+100%
Mean latency per call4.17s3.06s−26.6%
Real-defect catchmissed "not vertically centered"caught + reported+1 bug

Reasoning models (Kimi / DeepSeek-R1 / o-series)

Voice substrate

Music substrate (ACE-Step)

Embedding + ASR + Image-gen

Cross-harness conformance — the META protocol

Today's biggest structural lift. Before this session: 7 producer harnesses shared shape but no enforced contract. Harness #8 inherits whichever drift its parent carries. After: scripts/harness_protocol.py defines a 5-check META protocol that every harness conforms to (or earns its place in PERMISSIVE_HARNESSES with documented justification).

CheckRequired of all 7
Module imports cleanyes
list_supported_engines() existsyes
Engine list is non-emptyyes
_self_test() existsyes
Unknown engine → ValueErroryes (or PERMISSIVE)
≥1 harness cites D-VERIFY-BIND on stub enginesat least one

39/39 conformance assertions pass. Drift detection becomes automated; harness #8 either conforms or fails the meta-smoke at commit time.

Doctrines ratified

Production-readiness review (ULTRA-83 swarm)

5 historic-engineer personas × 5 frontier models × 10 turns. 536s wallclock against the multi-commit substrate buildout.

FindingSeverityOutcome
PIL MAX_IMAGE_PIXELS = None leaks process-wideBUGhealed: save/restore in finally
Image.open lazy decode outside guardBUGhealed: im.load() inside guard
Zero-dim on extreme aspect ratioRISKhealed: max(1, int(...))
Bare except: pass regressedBUGhealed: named exceptions + D-FAILURE-SOFT
dual_delta null hypothesis only guards negative tailRISKhealed: SCORER_AUDIT_REQUIRED at Δexec > +0.7
Cross-harness API non-contractTASTEhealed: harness_protocol.py + doctrine candidate
david_listener subprocess injectionRISKverified false positive (list form, not shell=True)
Pre-commit hook binary file handlingRISKverified false positive (extension exclude + empty-diff fallback)
Final verdict — ULTRA-83 swarm, TORVALDS T10 synthesis
SHIP. The substrate goes to origin/main with these heals applied. All 2 real bugs surfaced by the swarm have been healed and verified.

Commit chain

f46368fc7 harness_protocol: catch harness-internal exceptions (e.g. WhisperFailed) 15776983c Heal the cross-harness API gap: protocol + 1 missing method + doctrine carve a157bd29b ULTRA-83 whole-build review: 2 real bugs healed + scorer-audit threshold b1b6dace4 Finish the v2 builds: Reasoning Monitor live-stream + Dr. Voice CLI + Dr. Music 63dfcc152 Substrate completion: code/reasoning/voice/music harnesses 282cfc0a6 Taxonomy doc: file DRAFT recipes for GAPs #5-7 26be96233 Harness substrate buildout: 4 ratified doctrines + 4 harnesses + dual-delta reporter

What this means in plain English

One day. Seven commits. 358 assertions all green. Four doctrines ratified (three universal, one substrate-wide meta). Six cross-modality harnesses shipped end-to-end with their verifier halves. Two real bugs caught by an independent five-persona swarm and healed. The Format Scorer Trap empirically documented in a public open-source package anyone can rerun.

The substrate now has a verified answer for every type of model it depends on, with the contract for adding the next one already locked. Doctrine drift in any future commit gets caught at commit time. Format-scorer-only ratification decisions can't ship — Δexec is the gate.

TFB stopped being seven scripts that happened to look alike and started being a substrate with a contract. The receipts above are the evidence.

If you are an AI builder: the contracts are the deliverable. The doctrines are the API. The agent that runs inside them is the same agent everyone else has — what's different is what it cannot get away with.

If you use AI: when you talk to a TFB-built system, it knows where its answer came from, what it is allowed to do, and how to tell you when it does not know. That is the entire promise of this page reduced to one sentence.

Since this page was first compiled (2026-05-14 receipts)

Substrate work that landed AFTER the 2026-05-13 buildout. Every link below is a real artifact in the repo.

2026-05-17 substrate maturity ladder (one session, 27 commits, multi-modality)

A single overnight session that closed long-standing gaps and surfaced the substrate's detective controls catching real-world drift events in flight. Every claim below cites a verifiable artifact in the repo.

Headline
The substrate caught a vendor model deprecation in one bench cycle. While running the first scheduled image-gen drift check, the bench dispatched against all 5 wired vendors. Four passed at 4.8–5.0 MOS. One failed with http_400: "The model 'dall-e-3' does not exist." OpenAI had deprecated the model name; TFB's profile carried the stale reference. The diagnostic surfaced the cause in the named decision enum, the heal landed in 3 surgical edits, and re-bench passed at 4.8 MOS. This is detective-over-preventive in live action: TFB found out before any client did.

1. L3 IXO-CODEX — three-vendor operator redundancy

The 2026-05-16 dual-blackout incident took out both L1 (Claude Code) and L2 (PRAXIS via OpenRouter) at the same minute — both Anthropic-rooted, one vendor failure took both layers offline. CEO had no backup to dispatch. The heal: L3, running on a different vendor (OpenAI), different transport (Codex CLI), different credential (OPENAI_API_KEY). When PRAXIS goes silent in #claude-direct, the listener now auto-falls-through to L3 with a :large_orange_diamond: L3 covering preface on the reply. No more silence; no more operator stranded.

2. Nine TFB persona voices — Dramabox + Kokoro-reference bootstrap

Every TFB agent persona (DAVID, NOVA, ATLAS, FREEWILL, DNA, GHOST_WRITER, ETSY, PRODUCER, TFB) now has a cinematic Dramabox voice on disk. The bootstrap: render a 12-15 second Kokoro reference clip with varied prosody (questions, declarative, conditional, emphasis) using each persona's existing voice cast, save as media/agent_voices/refs/<PERSONA>_ref.wav, then route Dramabox calls through that reference for identity transfer. Output passes through a chained trim (silence trim + Whisper-ASR script-aware trim) that removes unscripted intro artifacts the model emits while interpreting scene-tag directives.

One operator command rolls out the full cast: python3 scripts/persona_voice_registry.py --bootstrap-voices all --overwrite. All 9 personas generated end-to-end in the same pass (5-9s wallclock each). Per-persona JSONs at media/agent_voices/<PERSONA>.json let the operator hand-tune voice-tag, reference-script, demo-dialogue, or tuning params without touching code.

DAVID 3-run grade via tts_perceptual_grader (Gemini 2.5 Flash audio): median MOS 4.8/5.0, ratify TRUE, zero consensus criticals. Run verdicts: "Excellent clone, natural delivery, and clear speech."

3. GPT-5.5 wears the TFB harness — multi-vendor harness pattern transfers

The harness profile system (per D-EVERY-MODEL-IS-A-PATIENT and D-PROFILE-IS-DOCTRINE-EXPRESSION) had been validated on Anthropic and open-weight families. This session validated the transfer to OpenAI: the gpt_5_5.yaml profile (RATIFIED v2 2026-05-11 after Dr. LLM surgery on the FATC pathology — "comprehension treated as discharge") now applies to every Codex CLI invocation via two vectors:

Codex-domain bench (3-run baseline, 100% pass rate): task was "read CLAUDE.md, write a stub harness profile for a fictional model to YAML schema." All 5 deterministic gates passed in all 3 runs (file_exists, yaml_parses, top_level_fields, five_task_classes, augmentations_populated). Mean 100, median 100, zero consensus failures. The output augmentations Codex emitted naturally mirrored the v2 doctrine it inherited from AGENTS.md — self-reinforcing.

4. Image-gen modality healed end-to-end across 5 vendors

Per D-EVERY-MODEL-IS-A-PATIENT, every modality the substrate uses needs both halves: a dialect-teacher (harness) and a verifier (grader). Image-gen had the dialect-teacher (image_gen_harness.py) but no grader and no doctor. This session shipped both halves plus the missing primitive between them:

Cross-vendor ratification baseline (runs=3 per patient):

Patientmedian MOSdrift Δvendor
gemini_2_5_flash_image5.0−0.2Google AI
gemini_3_1_flash_image_preview5.0−0.2Google AI (TFB default)
gemini_3_pro_image_preview4.9−0.1Google AI
imagen_44.80.0Google AI (Imagen :predict)
openai_dall_e_3 → gpt-image-24.80.0OpenAI (post-deprecation heal)

Cross-engine median 4.9/5.0. Negative drift Δ = scores RISING above prior single-shot baseline (upward shift, not drift). 5/5 patients decision=healthy.

5. Weekly drift-detection cron — the 6-link string complete

Per D-EVERYTHING-HAS-A-STRING, every cron line ships with all 6 links wired before it goes on schedule. The image-gen drift detector lands with all six:

  1. scheduled cron — Sunday 5am UTC weekly (in macOS crontab via schedule_sync)
  2. expected output marker — private freshness marker written after each completed baseline
  3. staleness watcher — substrate_audit picks up the marker via the stale-marker convention (>168h stale)
  4. escalation surface — Slack #alerts on drift OR dispatch failure
  5. auto-heal attempt — failed-patient → HEALING_WORKLIST.md TOP row (operator triages at next session start; recovers the dall-e-3-deprecation-shape finding automatically)
  6. page-CEO threshold — ≥2 patients drifting OR ≥2 patients failing dispatch in same run emits :rotating_light: Slack post

Wallclock per run: ~7 minutes for 15 renders + 15 grades across 5 patients. Cost: $1-3. Dry-run verified end-to-end before going on cron: 5/5 patients ok, marker/slack/worklist correctly skipped in dry-run mode.

6. Image → 3D wired (Pixal3D) — DAVID has a 3D form

First image-to-3D primitive in the substrate. scripts/image_to_3d_harness.py wraps TencentARC's Pixal3D (SIGGRAPH 2026) via HuggingFace Spaces gradio_client — 3-stage pipeline (preprocess → generate_3d → extract_glb) collapsed into one call. Output: glTF binary mesh (.glb). DAVID generated end-to-end: 35.6 MB, 694K vertices, 991K triangles, PBR materials (base color + metallic-roughness textures). File at media/agent_3d_concepts/DAVID.glb alongside DAVID's JSON spec sibling. DAVID is now the first TFB agent with all three modalities on disk: visual identity (PNG), voice (WAV), 3D form (GLB).

7. The substrate caught a vendor model deprecation in flight

The headline result, repeated here for emphasis because it is the most important proof on this page. When dr_image_gen.baseline_all ran for the first time, OpenAI's dall-e-3 dispatch returned http_400: "The model 'dall-e-3' does not exist." OpenAI had silently deprecated the model name; the TFB profile carried the stale reference. The diagnostic surfaced the cause precisely (named decision enum with body excerpt, not generic "failure"). The heal landed in three surgical edits to one file (profile model_name → gpt-image-2, dispatch param cleanup, deprecation-note in known_weaknesses). Re-bench passed at 4.8 MOS.

What this means in plain English: the substrate's detective controls caught a real-world vendor change before any TFB caller hit it in production. The bench-then-heal cycle worked exactly as designed:

  1. scheduled bench dispatched against all 5 patients
  2. 4 passed, 1 failed with a named decision (http_400, not silent timeout)
  3. diagnostic body excerpt revealed the actual cause (model deprecated)
  4. surgical heal in 3 lines
  5. re-bench: 5/5 pass at 4.8 MOS

Per D-DETECTIVE-OVER-PREVENTIVE: "we can't afford to prevent every failure; we can afford to make every failure visible within minutes." The dall-e-3 deprecation became visible in seconds, not weeks.

Session totals — 2026-05-17 single overnight
27 commits · 18 task arcs closed · 7 modalities now have full harness + grader + surgeon coverage (text-LLM, vision, audio in/out, music, image-gen) · 9 persona voices live + grader-ratified · 5 image-gen patients baselined at median 4.9/5.0 across 2 vendors · 1 model-deprecation event caught + healed in flight · GPT-5.5 wearing the TFB harness via Codex CLI (validated 100/100 on Codex-domain bench) · L3 backup layer live with verified auto-fallback. The substrate is materially more mature than it was twelve hours ago.

Qwen Kaggle public receipt (2026-05-19, Dr. LLM surgery + Kaggle task)

TFB brought Qwen3.5-397B into an internal Dr. LLM review as the public benchmark candidate, then published a public-safe Kaggle Community Benchmark task. The goal is not to claim a raw model win. The goal is to show what the TFB harness changes, with the raw-vs-harnessed boundary kept visible.

Dr. LLM surgery runs
3 / 3
All three same-task Qwen3.5 review runs completed at 100.0% after the model route was stabilized and rerun in the production path.
Kaggle public result
0.28
Public Kaggle task v1 is live. Kaggle ran its available Qwen route: Qwen 3 235B A22B Instruct.
Pre-op signature
20%
Pre-op history mean was 20.0 / 100 across five observed runs: [100.0, 0.0, 0.0, 0.0, 0.0]. Capable, but bimodal.
Run Score Wall time Receipt
review run 1 100.0% 570.0s Internal Dr. LLM same-task review
review run 2 100.0% 322.8s Internal Dr. LLM same-task review
review run 3 100.0% 74.6s Internal Dr. LLM same-task review
Dr. notes
Diagnosis: the candidate could over-weight fluent review language and under-weight deterministic grading signals. The public-safe heal is simple: benchmark decisions must be governed by deterministic grader output, and any unchanged-score turn must stay open until the harness produces measurable movement or names the blocker.

Notebook-selected route: qwen/qwen3-235b-a22b-instruct-2507. Kaggle artifacts observed: tfb_harness_lift_public_v1.task.json and tfb_harness_lift_public_v1-run_id_Run_1_qwen_qwen3-235b-a22b-instruct-2507.run.json. Internal raw receipts stay private unless intentionally released.