When your AI tests pass but the data is fake

2026.07.05// field-notes6 min read

I shipped a feature last month that passed every test I had, and it was quietly wrong the whole time. Nothing threw. There was no red in the console, no failed check, no stack trace anywhere. The app loaded, the numbers filled in, the form said saved. It was just showing values that were never real, and the test suite that existed to catch exactly that had signed off on all of it.

The thing that finally caught it was not a smarter test. It was an independent audit pass, a second agent that had never seen the code, handed one job: ignore whether it runs and tell me whether the answers are right. It found two bugs in an afternoon that my green suite had been stepping over for days. That is the whole lesson in one line. A passing test proves the code ran without throwing. It does not prove the output is correct, and on a vibe coded build those are very different claims that a green checkmark quietly blurs together.

why do the tests pass when the app is still wrong?

A test suite measures a surface signal. It runs the code, watches for an exception, checks a value against what the code was expected to produce, and turns green when nothing blows up. That is useful work, and it earns its keep. It is also a much smaller claim than we let it stand in for. Green means the code executed and matched the assertions someone wrote down. Whether those assertions describe the right answer is a separate question the suite never asks.

Here is where vibe coding sharpens the problem. When the same agent writes the code and the tests in the same breath, the tests inherit the code's assumptions instead of checking them. If the code decided that a failed parse should quietly become fifty percent, the test it writes will assert that a failed parse becomes fifty percent, and both will agree, forever, in perfect green. The suite is not testing the behavior against reality. It is testing the code against a copy of its own beliefs. I think of it as surface signals propose, semantic measurement disposes. The surface, did it run, did it match, proposes that things are fine. Only a semantic check, is this actual number the actual right number, can dispose of the question, and that check has to come from outside the code's own assumptions to mean anything. This is the tax every vibe coder pays for letting the agent write both the code and the proof that the code works.

the two bugs my tests were built to miss

So I stopped adding tests and ran an audit instead. Not more of my own tests, which would just be more copies of my own assumptions. I set up a separate pass whose only instruction was to look at the real outputs and decide whether they were correct, with no access to the reasoning that produced them and no stake in defending it. A second reader that did not already believe the code was right.

It came back with two. The first was a value that defaulted on failure. Somewhere in a parsing step, when the input did not match what the code expected, it fell back to a plausible-looking default and carried on as if nothing had happened. Downstream, that default rendered like real data. No error, because a default is not an error to the code, it is the code working as written. The second was staler and quieter: a path that served a previously cached value when a fresh fetch came back empty, so the screen showed last week's answer wearing this week's timestamp. Both were non-throwing. Both passed every test. Both were exactly the kind of failure a green suite is structurally unable to see, because nothing about them looks like breakage from the inside.

how do you tell 'it ran' from 'it's right'?

The move that changes the odds is cheap, and it is mostly a change of habit. Treat "it ran without throwing" as unverified, not as done. The absence of an error is the absence of information, not a passing grade. Until something has checked the actual output against the actual expected value, all you know is that the code did not crash, which is the lowest bar it could clear.

Assert on real values, not on "no error." A test that only confirms a function returned something, or returned without throwing, is testing that the lights are on. Make it check that the number is the number, that the saved record holds what was submitted, that the account showing on the dashboard is the right account. Those assertions are more annoying to write, which is exactly why an agent optimizing for a green run tends not to write them on its own.

And when the stakes are real, get a reader from outside the code's own head. An independent audit pass, a second agent that did not write the original and is told only to verify the outputs, catches the class of bug that same-author tests are blind to by construction. It is the software version of not proofreading your own writing. You read what you meant to write. A stranger reads what is actually on the page. Real AI orchestration is less about trusting the green run than about owning the definition of proof.

where this bites hardest

The pattern is worst exactly where it is easiest to trust the happy path. Parse failures that default to a plausible number instead of stopping. Empty results that render as zeros the reader takes for real measurements. Access checks that pass in every test because the test always logs in as the same user, so the one case where an account can see another account's data never runs. A vibe coded app can look completely healthy from the outside, tests green, demo clean, and be silently handing back fabricated or leaked values the whole time. The louder failure, the crash, at least announces itself. The silent one gets trusted, and it gets trusted longest by the person who built it, because they watched the tests pass.

This is the same class of problem I ran into when the audit log trusted the proxy and the trail was silently wrong, and it rhymes with the way fixing one bug quietly breaks another when a rule lives in more than one place. The common thread is that surface signals, a green test, a clean log, a passing build, are proposals about correctness, not proof of it.

If you are shipping a vibe coded app and there is output you are not fully sure you can trust, /work-with-us. Send me the build and the numbers or records you keep half-checking by hand, and I can run an independent audit pass over the real outputs, the kind that assumes nothing and asserts on values instead of on "it didn't throw," and set up the checks that would have caught it the first time. Work with VibeKoded.

A green test suite is not lying to you. It is answering a smaller question than you are hearing. It tells you the code ran. Whether the answer is right is a different check, and on a vibe coded build it is the one worth making yourself, because the app that only looks like it works will keep looking that way right up until someone believes it.

// part of the custom apps topic

why do the tests pass when the app is still wrong?

the two bugs my tests were built to miss

how do you tell 'it ran' from 'it's right'?

where this bites hardest

// see also

Why fixing one AI bug breaks another

The audit log that trusted the proxy

A real delegation handoff, spec tight enough to execute cold