A correctness system - for making AI-generated code reliable

Validate AI-generated code that passes tests but fails in production.

Stop using code-verification approaches that waste your time and still let errors get through undetected.

Free. No new tools. Works inside Claude Code. Just correctness. Enforced through a harness.
Ch. 01  ·  The diagnosis

Test-passing errors are not anomalies.

They are a structural property of AI-driven code development.

Example 1· …s
Scene 01 · Title

Example 01

How test-passing errors arise

Tests pass.

The code is still wrong.

Scene 02 · Requirement
Ticket · AUTH-142 Open
Opened by @jess · 3 days ago
Scene 03 · Interpretation
Ticket · AUTH-142 Open
Opened by @jess · 3 days ago
Reject expired tokens
AI code generator
expires_at < now()
Scene 04 · The behaviour
Faced with ambiguity, the model picked instead of asking.
expires_at < now()
Scene 05 · As code
That pick becomes the function.
if (expires_at < now()) reject()
Scene 06 · As test
That pick becomes the test.
expect(expires_at < now()).toReject()
Scene 07 · The pattern
Everything agrees. Still wrong.
Requirement (ambiguous English)
Reject expired tokens
Specification
expires_at < now()
Implementation
if (expires_at < now()) reject()
Tests
expect(expires_at < now()).toReject()
Scene 08 · Validation
$ running checks Tests passing Lint clean Types valid
Scene 09 · Failure
$ 14 hours later Production issue detected
Scene 10 · Reflection
An engineer at her desk, puzzled

"What did I miss?"

— every engineer, eventually

Example: How test-passing errors arise

… seconds
Example 01 · End

Example: How test-passing errors arise

Even with better models, ambiguity in the specification and shared reasoning remain.

So more code-verification effort, within the way you're verifying code today, won't change your outcome.

Ch. 02  ·  The cost

Verification increases token cost.

This is already happening in production workflows.

Based on user calls and real-world usage, as one engineer on the Claude Code team observed:

Thariq @trq212

done about 10 of these [user conference] calls so far+ looked at more transcripts many learnings but one of the biggest is that it's very easy to spend a lot of tokens on open ended verification that doesn't make your output better...

· 168.1K Views
114 34 1k 308
View on X →

When verification cannot establish correctness, teams compensate by doing more of it:

Token usage increases.

Costs become unpredictable.

The burden shifts back onto you.

Reviewing outputs. Tracing logic. Trying to establish correctness by hand.

The outcome does not improve.

Ch. 03  ·  The contradiction

Given this:

  • ambiguity in the specification cannot be eliminated
  • the same reasoning is reused across the system
  • validation within a shared session cannot correct it

Then:

Correctness cannot be established within a shared reasoning path.

It must be enforced through:

  • independent derivation of artefacts
  • adversarial challenge between reasoning systems
  • verification that does not reuse the same reasoning path
Ch. 04  ·  The principle

Correctness cannot be established within a single reasoning system.

Correctness requires independent reasoning.

Independent reasoning requires separation.

The reasoning used to generate the code must not be the same reasoning used to validate it.

Based on what the Claude Code team is seeing in their own workflows, as Boris Cherney, the creator of Claude Code, put it:

Boris Cherny @bcherny

...what helps [code quality problems] is also having the model code review its code using a fresh context window...

· 1.3M Views
168 529 7k 4.7k
View on X →
Ch. 05  ·  The architecture

Adversarial TDD enforces this separation.

It structures the interaction so that reasoning paths are independent and can challenge each other:

This is enforced in code - a harness that orchestrates the workflow, not relying on prompts alone.

As a result, when reasoning paths disagree, the system does not resolve the conflict — it exposes it:

How it works in practice

You stay in Claude Code. Adversarial TDD runs in the background.

If no ambiguity is encountered

You don't need to do anything.

You receive the completed code and a session summary in Claude Code.

If ambiguity cannot be resolved

You are always in Claude Code. Adversarial TDD stays in the background.

Instead of test-passing errors silently propagating through the pipeline, failure becomes a signal.

Example 2· …s
Scene 01 · Title

Example 02

How we prevent test-passing errors

Same ambiguity.

No test-passing error reaches production.

Scene 02 · Isolated implementation
Isolated implementation session
Isolated implementation session
Ticket · AUTH-142 Open
Opened by @jess · 3 days ago
Reject expired tokens
AI · isolated implementation
if (token.expires_at < now()) reject()
Scene 03 · Isolated test derivation
Isolated test-derivation session
Isolated test-derivation session
Ticket · AUTH-142 Open
Opened by @jess · 3 days ago
Reject expired tokens
AI · isolated test derivation
expect(tokenOlderThan15Minutes()).toBeRejected()
Scene 04 · Two interpretations
Two isolated sessions. Two interpretations.
Two isolated sessions. Two interpretations.
Implementation
if (token.expires_at < now()) reject()
Tests
expect(tokenOlderThan15Minutes()).toBeRejected()
Scene 05 · Detection & escalation
$ running checks Tests failed
Try to resolve
Ambiguity detected
Expiry rule is underspecified
Scene 06 · We saw a signal, not a bug
We saw a signal, not a bug.
We saw a signal, not a bug.
Independent sessions reached different interpretations.
Tried, couldn't agree.
Ambiguity exposed as disagreement.
Escalated to human to resolve.
Scene 07 · Requirement clarified
Ticket · AUTH-142 Needs clarification
Returned by adversarial-tdd · just now
Scene 08 · Re-run
$ running checks Interpretations aligned Tests passing
Scene 09 · Outcome
$ 14 hours later No production issues
Scene 10 · Reflection
The same engineer at her desk, calm and satisfied

"Nothing broke."

— every engineer, eventually

Example: How we prevent test-passing errors

… seconds
Example 02 · End

Example: How we prevent test-passing errors

Adversarial TDD mirrors the independent derivation principle used in safety-critical software engineering: correctness is not guaranteed, but multiple independent reasoning paths increase the chance that incorrect assumptions surface early and trigger escalation.

Ch. 06  ·  The invitation

Getting access to Adversarial TDD.

Adversarial TDD is currently in development.

We're speaking with engineers working on systems where correctness matters.

If this matches your workflow, you can request early access.

You stay in Claude Code. Adversarial TDD runs in the background.

Join the early access list

Where do you most often see AI-generated code fail after passing tests?
How do you currently try to catch or reduce these issues?

We'll reach out with updates and early access details.