// Make the bad state impossible · lesson 06

Safety by structure, not by prompt

The most common safety mistake I see, and one I've made, is writing the guardrail into the prompt. "Never delete production data." "Always confirm before sending." "Don't touch the money." It reads like a safeguard. It's really just a hope with good grammar, and it fails for a reason that's baked into how these models work.

The prompt is the one layer the model is free to reinterpret. From the foundations track: instructions have a priority order, and a rule in the conversation is a decaying influence, weaker the further back it sits, competing with everything since. So your careful "never delete production data," written a hundred messages ago, is now a faint voice under a mountain of more recent context, and the model can misread it, deprioritize it, or reason its way around it in a moment where some other goal feels more pressing. If the only thing stopping the catastrophe is words the model is allowed to reinterpret, then the catastrophe is one bad reinterpretation away, and reinterpretation is the model's native mode.

Why can't you just word the instruction strongly enough?

Because the problem isn't the wording, it's the layer. No phrasing, however emphatic, changes the fact that a prompt is a suggestion the model weighs against everything else, not a constraint that binds it. You can write "NEVER, under ANY circumstances" in all caps and it's still living in the layer the model gets to interpret, still one long context or one clever framing away from being outweighed. Strength of language is not a substitute for structural impossibility. A locked door doesn't ask nicely, and no amount of asking nicely becomes a lock.

Where the constraint actually belongs

In the layer the model cannot touch. A permission it wasn't granted, so the dangerous action is simply unavailable. A deterministic function it has to route through, so the value can't come out wrong. A validation gate that rejects the bad state mechanically, no matter what the model intended. Put the real constraint there, in structure, and let the prompt do what prompts are actually good for: shaping helpful behavior on the vast majority of cases where being wrong is merely unfortunate. Prompts for guidance, structure for guarantees. Never confuse the two.

The takeaway: a guardrail in the prompt is a suggestion the model can reinterpret, so put the constraints that matter in the layers the model can't touch, and keep the prompt for guidance, not guarantees.