TFB's harness substrate takes any LLM, diagnoses its specific worldview drift via a 5-step surgical protocol, and ships a per-model profile with a falsifiable prediction bound to it. This is not a promise of perfect models. It is a measured reliability workflow: bench, diagnose, profile, re-bench, ratify or revert.
Bench-mean improvements of 2–3× across 5 ratified profiles. Ceiling runs hit cleanly; bimodal-shape variance persists on a minority of runs. We characterize at n=10 minimum before ratifying. The full variance picture is in every profile's evolution log; nothing is hidden.
Watch the proof-first cut, then test the harness. TFB shows the raw route, the harnessed route, and the claim boundary before anybody gets to brag.
Create an account with Google, then choose a path. Invite codes unlock TFB-approved harness profiles. Bring-your-own-route users can register an OpenRouter, DeepSeek, or local model route without pasting secrets into the page. Prompt content is not shown to the dashboard; aggregate performance receipts are the metric.
Open or close public account/code actions without changing the page.
Create a client code for the approved harness profiles.
This panel is for aggregate harness behavior: run count, OK rate, route mode, and output footprint. It does not display prompt text.
| Patient | Bare-model mean | With harness | Improvement |
|---|---|---|---|
| DNA (deepseek-chat-v3.1) | ~25 | 100.0 | 4.0× |
| DEEPSEEK | ~30 | 95.0 | 3.2× |
| LLAMA 3.3 70B (via OpenRouter) | 46.7 | 100.0 | 2.1× |
| GLM 4.7 flash | 15.0 | 56.7 | 3.8× |
Numbers are bench means at the time of last ratification. Run-level variance is bimodal on most patients — ceiling runs hit cleanly, occasional zeros persist (the PRIMING-BY-NEGATION shape DeepSeek's evolution log documents). Falsifiable predictions bound each ratification; REVERTED on miss. See GLM v1, DeepSeek v3/v4, ChatGPT v1/v1.1 reverts in the public evolution logs.
→ Swap to GPT-5.5 / Claude Opus
→ Pay 30× more per output
→ Pray reliability holds
→ Run a Dr LLM surgery (15 min, ~$3)
→ Get a per-model profile
→ Drop in via harness loader
→ Bench-mean lifts 2–3× (verified n=10 before ratification)
→ Falsification gate auto-reverts misses
→ Keep the cheap model with eyes open about variance
Worldview drift patterns named through TFB's surgical protocol. First-mover advantage compounds: when a new model ships, our library tells us which patterns to test first.
Tell us which model you're trying to make production-grade. First 50 waitlist members get a human-reviewed diagnostic path.
No payment required. We review requests manually before any customer data touches Dr. LLM.
Share a small diagnostic task. The intake gate rejects likely
secrets, PII, and oversized payloads. Accepted pilots land as
pending_review; no surgery runs until a human approves.
Failure/billing rule: no charge before operator approval. Failed surgeries are not billable unless a signed pilot agreement says otherwise.