The feature flag that silently disabled prompt caching

I found it in the response metadata, not the error log. The error log had nothing to report, because nothing had gone wrong in the way an error log understands. The automation still ran. It still produced the right output. It was just paying full price for every call, quietly, for longer than I want to admit.

The setup was a standing automation I vibe coded a while back: a small agent loop that runs on a schedule, reads a large pile of mostly static context, and acts on it. That shape is exactly what prompt caching is built for. Cache the stable prefix once, then pay a fraction to read it back on every later call. With caching working, the per-call cost sat right where I expected. I had measured it early, written the number down, and stopped thinking about it. That last part was the mistake.

the change that looked harmless

A few weeks in, I turned on the model's extended reasoning mode for one step in the loop. The output quality went up, which is what I was after, so I shipped it and moved on. What I did not know is that flipping that one toggle changed the request enough that the cached prefix no longer matched. Every call started missing the cache and reprocessing the full context from scratch. Caching did not error out. It did not warn. It just silently stopped happening, and the cost of reprocessing that large static context moved from a rounding error to the dominant line on the bill.

Nothing in the system was wired to notice. The run succeeded. The output passed its checks. The only place the truth showed up was a single field in the provider's response metadata, the one that reports how many tokens were read from cache. It had been a healthy number. Now it was zero, on every call, and it had been zero since the day I turned on the reasoning step.

the failures that do not throw

I caught it for one reason. I had instrumented per-call token cost early, back when I first measured the automation, and I still glanced at it out of habit. The cache-read counter lived in that same metadata, so the zero was sitting right next to a number I already watched. If I had trusted the happy path the way the error log did, the automation would have run for months exactly like that, correct and expensive, and the first real signal would have been the invoice.

This is the failure mode I have learned to build for first, ahead of the obvious ones. The bugs that throw an exception announce themselves, and your error handling is already pointed at them. The bugs that cost you the most are the ones that keep working. A disabled cache. A retry loop that silently doubled. A parser that quietly returns a default of fifty percent when it fails, so the run stays green and the numbers go wrong underneath it. None of those raise anything. They sit inside a clean run and bleed, which is the whole shape of a silent automation failure: the run looks healthy right up until you read the one number that was wrong the entire time. If the only surface you build is the one that catches thrown errors, the silent regressions live in that blind spot on purpose.

Failure-surface-first is the habit I pulled out of it. Before I trust an automation to run unattended, I ask what it would look like if this thing failed without telling me, and I build the readout that would show that failure before I need it. For a cost regression, that readout is per-call spend with the cache-read counter sitting next to it. For a data bug, it is the distribution of the output, not just the absence of an exception. The surface comes first, because the failure it catches will not arrive politely. I have made this argument before in a different unit, back when I wrote about why a vibe-coded pipeline needs observability before you let it run on its own; a silent cost regression is that same argument with a dollar sign on it. Most vibe coders running automation never look at the cache-read counter, because nothing in a green run tells them to.

what this means if you run automation

If you vibe code automations that run on their own, the move is to assume the worst failures will be silent and to instrument for them up front. Pick the two or three numbers that would change if the automation quietly went wrong: cost per call, cache-hit rate, the shape of the output. Put them somewhere you actually look. Not a dashboard you build once and forget. A number that sits next to one you already watch, so a bad value is loud by proximity.

The reason this bites harder on vibe-coded work than on hand-written code is that the orchestration layer behind vibe coding hides the cost of its own mistakes well. A generated solution that works on the happy path looks finished. It compiles, it runs, it returns the right answer on the case you tested. Whether it is also caching, retrying sanely, or failing loudly when it should is a separate question, and the generator will not raise its hand about it. You have to go ask, and the asking is a surface you build. That is the unglamorous half of AI orchestration: not the part that generates, the part that watches what got generated.

If you're moving from chat-built scripts to automation that has to hold in production, and you want a second set of eyes on where it might be failing quietly, /work-with-us. Send me the automation and where you suspect the waste is, and I can scope an instrumentation pass that puts cost and reliability where you can see them, or a spec-first rebuild that bakes the failure surfaces in from the start. Work with VibeKoded.

The caching came back the moment I knew to look. One config change, and the per-call cost dropped back to where I had measured it weeks earlier. The fix was nothing. The lesson was the gap between the day it broke and the day I noticed, and the only thing that closed that gap was a number I had decided, early, was worth watching. Build the surface for the failure that stays quiet. It is the one that bills you.