A correctness system - for making AI-generated code reliable

Validate AI-generated code that passes tests but fails in production.

Stop using code-verification approaches that waste your time and still let errors get through undetected.

Free. No new tools. Works inside Claude Code. Just correctness. Enforced through a harness.

Ch. 01 · The diagnosis

Test-passing errors are not anomalies.

They are a structural property of AI-driven code development.

specifications are written in ambiguous natural language
existing code may already contain an incorrect assumption that the model inherits
early misinterpretations occur and are not corrected

Andrej Karpathy @karpathy

...the models make wrong assumptions on your behalf and just run along with them without checking...

· 7.6M Views

1.6k 6.8k 39k 35k
View on X →
the entire process runs in a shared session that agrees with those misinterpretations
specification, interface, implementation, and tests share the same assumptions
incorrect reasoning propagates through the system

Example 1· …s

Scene 01 · Title

Example 01

How test-passing errors arise

Tests pass.

The code is still wrong.

Scene 02 · Requirement

Ticket · AUTH-142 Open

Opened by @jess · 3 days ago

Scene 03 · Interpretation

Ticket · AUTH-142 Open

Opened by @jess · 3 days ago

Reject expired tokens

✦ AI code generator

expires_at < now()

Scene 04 · The behaviour

Faced with ambiguity, the model picked instead of asking.

expires_at < now()

Scene 05 · As code

That pick becomes the function.

if (expires_at < now()) reject()

Scene 06 · As test

That pick becomes the test.

expect(expires_at < now()).toReject()

Scene 07 · The pattern

Everything agrees. Still wrong.

Requirement (ambiguous English)

Reject expired tokens

Specification

expires_at < now()

Implementation

if (expires_at < now()) reject()

Tests

expect(expires_at < now()).toReject()

Scene 08 · Validation

$ running checks ✓ Tests passing ✓ Lint clean ✓ Types valid

Scene 09 · Failure

$ 14 hours later ✗ Production issue detected

Scene 10 · Reflection

"What did I miss?"

— every engineer, eventually

Example: How test-passing errors arise

… seconds

Example 01 · End

Example: How test-passing errors arise

Even with better models, ambiguity in the specification and shared reasoning remain.

So more code-verification effort, within the way you're verifying code today, won't change your outcome.

Ch. 02 · The cost

Verification increases token cost.

This is already happening in production workflows.

Based on user calls and real-world usage, as one engineer on the Claude Code team observed:

Thariq @trq212

done about 10 of these [user conference] calls so far+ looked at more transcripts many learnings but one of the biggest is that it's very easy to spend a lot of tokens on open ended verification that doesn't make your output better...

· 168.1K Views

114 34 1k 308
View on X →

When verification cannot establish correctness, teams compensate by doing more of it:

more prompts
more skills
more tests
more model runs

Token usage increases.

Costs become unpredictable.

The burden shifts back onto you.

Reviewing outputs. Tracing logic. Trying to establish correctness by hand.

The outcome does not improve.

Ch. 03 · The contradiction

Given this:

ambiguity in the specification cannot be eliminated
the same reasoning is reused across the system
validation within a shared session cannot correct it

Then:

Correctness cannot be established within a shared reasoning path.

It must be enforced through:

independent derivation of artefacts
adversarial challenge between reasoning systems
verification that does not reuse the same reasoning path

Ch. 04 · The principle

Correctness cannot be established within a single reasoning system.

Correctness requires independent reasoning.

Independent reasoning requires separation.

The reasoning used to generate the code must not be the same reasoning used to validate it.

Based on what the Claude Code team is seeing in their own workflows, as Boris Cherney, the creator of Claude Code, put it:

Boris Cherny @bcherny

...what helps [code quality problems] is also having the model code review its code using a fresh context window...

· 1.3M Views

168 529 7k 4.7k
View on X →

Ch. 05 · The architecture

Adversarial TDD enforces this separation.

It structures the interaction so that reasoning paths are independent and can challenge each other:

This is enforced in code - a harness that orchestrates the workflow, not relying on prompts alone.

Each SDLC stage is derived independently.
Specification, interface, implementation, and tests are derived from independent reasoning that does not share context or assumptions.
The specification is interpreted separately from the implementation.
The implementation is produced without access to the validation reasoning.
Validation is performed without reusing the assumptions that generated the code.
A single run involves at least 73 reasoning attempts and 69 adversarial attacks to surface hidden assumptions.

As a result, when reasoning paths disagree, the system does not resolve the conflict — it exposes it:

Correctness is no longer inferred from agreement.
It emerges from independent reasoning that withstands challenge.

How it works in practice

You start in Claude Code.
You run the /atdd-start slash command.
Claude Code asks structured questions about the code you want to build.
You explain it by chatting with Claude Code.
Claude Code hands the work off to Adversarial TDD.
From that point, the Adversarial TDD harness orchestrates the workflow in the background.
You see progress updates in Claude Code.

You stay in Claude Code. Adversarial TDD runs in the background.

If no ambiguity is encountered

You don't need to do anything.

You receive the completed code and a session summary in Claude Code.

If ambiguity cannot be resolved

Claude Code brings you back into the loop.
You clarify the intent by chatting with Claude Code.
Claude Code passes your decision back to Adversarial TDD.
The workflow continues.

You are always in Claude Code. Adversarial TDD stays in the background.

Instead of test-passing errors silently propagating through the pipeline, failure becomes a signal.

Example 2· …s

Scene 01 · Title

Example 02

How we prevent test-passing errors

Same ambiguity.

No test-passing error reaches production.

Scene 02 · Isolated implementation

Isolated implementation session

Ticket · AUTH-142 Open

Opened by @jess · 3 days ago

Reject expired tokens

✦ AI · isolated implementation

if (token.expires_at
                      < now()) reject()

Scene 03 · Isolated test derivation

Isolated test-derivation session

Ticket · AUTH-142 Open

Opened by @jess · 3 days ago

Reject expired tokens

✦ AI · isolated test derivation

expect(tokenOlderThan15Minutes()).toBeRejected()

Scene 04 · Two interpretations

Two isolated sessions. Two interpretations.

Implementation

if (token.expires_at
                      < now()) reject()

≠

Tests

expect(tokenOlderThan15Minutes()).toBeRejected()

Scene 05 · Detection & escalation

$ running checks ✗ Tests failed

↻ Try to resolve

Ambiguity detected

Expiry rule is underspecified

Scene 06 · We saw a signal, not a bug

We saw a signal, not a bug.

Independent sessions reached different interpretations.

Tried, couldn't agree.

Ambiguity exposed as disagreement.

Escalated to human to resolve.

Scene 07 · Requirement clarified

Ticket · AUTH-142 Needs clarification

Returned by adversarial-tdd · just now

Scene 08 · Re-run

$ running checks ✓ Interpretations aligned ✓ Tests passing

Scene 09 · Outcome

$ 14 hours later ✓ No production issues

Scene 10 · Reflection

The same engineer at her desk, calm and satisfied

"Nothing broke."

— every engineer, eventually

Example: How we prevent test-passing errors

… seconds

Example 02 · End

Example: How we prevent test-passing errors

Adversarial TDD mirrors the independent derivation principle used in safety-critical software engineering: correctness is not guaranteed, but multiple independent reasoning paths increase the chance that incorrect assumptions surface early and trigger escalation.

Validate AI-generated code that passes tests but fails in production.

Test-passing errors are not anomalies.

Example: How test-passing errors arise

Example: How test-passing errors arise

Verification increases token cost.

Given this:

Then:

Correctness cannot be established within a single reasoning system.

Adversarial TDD enforces this separation.

How it works in practice

If no ambiguity is encountered

If ambiguity cannot be resolved

Example: How we prevent test-passing errors

Example: How we prevent test-passing errors

Getting access to Adversarial TDD.

Join the early access list