// lesson 06

When it breaks, triage before you rebuild

updated 2026.07.05// field-manual

I lost an evening to a rendering bug where some icons just would not show up. They were in the code, they were positioned right, they simply rendered invisible. Every instinct I had said rebuild the renderer, it's clearly broken at a level I can't see. I didn't. I diagnosed instead. The actual cause was one line: the icon was being drawn in white, and then the renderer multiplied it by the cluster's color, so a white icon on its own color channel came out as nothing. It camouflaged itself. One-line fix. The whole-renderer rebuild I almost did would have taken a day and reintroduced three bugs I'd already fixed.

That's the reflex this lesson is about killing. When an AI-built thing breaks or starts drifting, the pull toward "just regenerate it from scratch" is enormous, and it's almost always wrong. Regeneration feels like progress because you're typing and things are happening. What you're actually doing is throwing away every bug you'd already quietly solved and betting the new version won't have its own. Triage first. Rebuild is the last resort, not the first move.

Why is rebuilding so tempting and so wrong?

Because a broken system you don't understand feels infinite, and a blank prompt feels clean. But the broken system is mostly working, that's the part you can't see when you're frustrated. Ninety percent of it is fine. The break is usually small and local, one line, one assumption, one edge case. When you regenerate, you discard the ninety percent that worked to avoid diagnosing the ten percent that didn't, and the new ten percent that breaks will be different and just as annoying.

The frontier teams design for exactly this recovery instead of blind restart. Anthropic and OpenAI both build agents that halt on failure, isolate the broken step, and hand off for a controlled fix rather than letting a confused system flail or wiping the slate. (Anthropic, "Building Effective Agents" and OpenAI, "A practical guide to building agents") The instinct they encode is diagnose and contain, not tear down and repour.

The failure mode: surface fix versus semantic fix

The subtle trap in triage is fixing the symptom instead of the cause. The icons were invisible, so the surface fix is "make them brighter" or "add an outline," and it might even paper over the problem. But the cause was a color-multiply doing exactly what it was told. Patch the symptom and the same root cause resurfaces somewhere else next week. Triage means chasing the bug down to the actual line that's wrong, not the place where it happens to show up. Surface and cause are often nowhere near each other.

The takeaway: the rewrite is a tax disguised as a shortcut. Find the one line first. You'll be right more often than your panic wants you to be, and the fix will hold.

Questions that keep coming up

When is rebuilding actually the right call? When you've diagnosed and the architecture itself is the bug, not a line in it. Rebuild on a diagnosis, never on a mood. If you can't say specifically why the current thing can't be fixed, you haven't earned the rewrite yet.

How do I stop myself from rage-regenerating? Make yourself name the failing line out loud before you're allowed to touch anything. The demand for a specific diagnosis is usually enough to break the reflex, because the moment you find the line, the rebuild stops looking necessary.

Next: Lesson 7, externalize your memory. Own the state, not the thread.

If you inherited an AI-built codebase that's drifting and want someone to think through the stabilization with you, /work-with-us.