How to audit AI-generated code before you ship

// pre-launch// methodology8 min read

The AI delivered the code. The instinct is either to ship it (because the AI said it was done) or to be blocked from shipping (because reviewing code feels like engineering work you can't do). Neither is the right path.

The right path is a structured audit that a non-engineer operator can actually run. The audit doesn't require you to read every line and understand it the way an engineer would. It requires you to verify specific dimensions against specific criteria. The criteria are accessible; the discipline is in actually running them rather than skipping the audit because it feels technical.

I want to walk through the five dimensions of the audit, the three-leg promotion gate that integrates audit into the workflow, and the specific operator-level checks for each dimension. The framework is what makes AI-generated code genuinely shippable rather than hoped-to-be-shippable.

Dimension one: correctness

Does the code do what it was supposed to do? This is the most basic check and the most often skipped.

Operator-level checks:

Open the app. Run the flow the code is supposed to affect. Verify the visible behavior matches what was requested. This is the surface check from the four-leg acceptance pattern documented in why passing tests does not mean the app is fixed.

For changes that affect state (data, settings, integrations), verify the underlying state actually changed correctly. Check the database. Inspect the integration. Confirm the operation produced the result the surface suggested.

For changes that affect multiple flows, run the adjacent flows that share components or state with the changed code. Verify they still work the way they should.

The correctness dimension is the easiest for operators to evaluate because the test is "does the app do the right thing." You don't need to read code to test that.

Dimension two: security

Did the change introduce security exposure that wasn't there before? This dimension is the most consequential and often the one operators feel least equipped to evaluate.

Operator-level checks:

Did the change add any new credentials, API keys, or secrets to the code? Search the diff for patterns that look like keys (long strings of letters and numbers, environment variable references that don't exist elsewhere, hardcoded URLs to external services). Any of these should be configured through environment variables or secret management, never hardcoded.

Did the change add any new external network access? Search for new fetch calls, new vendor SDK imports, new webhook endpoints. Each one is a new trust boundary. Verify each is intentional and the external party is one you actually want to integrate with.

Did the change introduce any user-supplied data into code that executes (rather than just displays)? Look for places where input data is passed to functions like eval, dynamic imports, or system commands. AI agents sometimes write code that's vulnerable to injection without realizing it.

Did the change modify authentication, authorization, or access control? These are the highest-stakes security surfaces. Changes here should be reviewed carefully or, ideally, generated against explicit specifications rather than ad-hoc prompts.

External research has consistently shown that AI-generated code introduces security issues at meaningful rates, with operators sometimes bypassing established protocols when they assume AI output is safe. The audit dimension exists to counter that assumption.

Dimension three: dependencies

What dependencies did the change introduce? Each new dependency is a trust commitment, a maintenance obligation, and a potential security surface.

Operator-level checks:

Look at the changes to package.json, requirements.txt, Cargo.toml, or equivalent. New entries are new dependencies.

For each new dependency, verify it's a real package from a reputable source. AI agents have been known to invent package names that don't exist (which fails at install time) or recommend packages with low download counts and uncertain maintenance (which work but introduce risk).

For each new dependency, check whether it's actually needed. AI sometimes adds dependencies for trivial functionality that could be done without them. Each dependency is a maintenance burden; trivial ones shouldn't justify themselves.

For dependency upgrades (not new additions), check the version delta. A small patch upgrade is usually safe; a major version upgrade often has breaking changes. AI agents sometimes recommend major version upgrades without flagging the breaking-change risk.

Dimension four: performance

Did the change introduce performance regressions? This dimension matters less for small changes but more for changes to high-traffic paths or critical operations.

Operator-level checks:

For changes to operations that handle data, check whether the change is operating in a way that scales. A change that loads all records into memory works fine in development with 50 records and fails in production with 50,000.

For changes to operations that make network calls, check whether the change is calling APIs efficiently. Loops that make sequential API calls when a batch call exists are slower than necessary.

For changes that affect rendering or page load, verify the change doesn't introduce noticeable visual lag. Open the page, observe the load time, compare to before.

The performance dimension is often skipped because the test takes longer than the change took to make. The skip is fine for prototype-grade work and expensive for production.

Dimension five: accessibility

Did the change introduce accessibility issues? Operators often skip this dimension because they don't know what to look for. The checks are actually straightforward.

Operator-level checks:

Can the changed flow be operated by keyboard only? Try it. Tab through the flow. Verify focus is visible at every step.

Do new interactive elements have appropriate labels for screen readers? Inspect the HTML for ARIA attributes, label elements, alt text on images.

Does the change maintain color contrast for text? Visible contrast under the screen-reader testing tools that ship with browsers.

Does the change work on mobile devices? Resize the browser window or actually test on a phone. Many AI-generated layouts default to desktop assumptions and break on small screens.

Accessibility issues compound: each unaddressed issue makes the app less usable for some users. The audit catches them before shipping rather than after complaints arrive.

The three-leg promotion gate

The five dimensions become operationally useful when they're integrated into a structured promotion gate. The gate is what runs before any change ships to production. It has three legs:

Leg one: per-change audit. Every change goes through a focused audit covering the dimensions relevant to that change. A backend change might emphasize correctness, security, and dependencies; a frontend change might emphasize correctness, performance, and accessibility. The audit is scoped to what the change actually affects.

Leg two: batch audit. When multiple changes are accumulating toward a release, a batch audit checks for interaction effects. Two changes that each passed their per-change audit might interact badly. The batch audit looks for the interactions.

Leg three: pre-deploy audit. Before the changes actually ship, a final audit verifies the complete state. Build artifacts are correct. Deployment configuration matches expectations. Rollback paths are verified. The pre-deploy audit is the last gate before production effects.

This is the same pattern documented in how to vibe code a production landing page and integrated into the four-layer enforcement framework. The three-leg gate is what makes the framework operationally repeatable.

The operator's audit checklist

Consolidated as a checklist a non-engineer can actually run:

Correctness. Open the app. Test the flow. Verify visible behavior and underlying state both match what was requested.

Security. Scan the diff for new credentials, new external network access, user-input executable surfaces, and changes to auth/access control.

Dependencies. Inspect new dependencies for legitimacy and necessity. Check version deltas for breaking-change risk.

Performance. Test changes to data-handling, API-calling, or rendering operations under realistic conditions.

Accessibility. Keyboard test, screen reader inspection, color contrast check, mobile verification.

The checklist runs in 15-30 minutes for most changes. The investment is bounded; the prevention is meaningful. Most "we shipped something and discovered a problem in production" cases would have been caught by this checklist run before shipping.

When to engage external review

The audit framework is what an operator can do. There are cases where external review adds value:

The change is in a security-critical surface (authentication, payments, user data) and the stakes justify expert eyes.

The change affects compliance-relevant surfaces (data privacy, regulated industries, audit trails) where the consequences of getting it wrong include legal exposure.

The change is in a domain you don't have expertise in (a non-engineer operator shipping a database migration, for example) and you want someone who's seen this kind of work before.

The change is part of a larger system where you're not confident about the integration effects.

External review costs are bounded and known upfront. The cost of not having external review on high-stakes changes is unbounded and tail-risk.

What this means for current AI-assisted work

If you're shipping AI-generated code without a structured audit, the audit is the highest-leverage change you can make to your workflow. The checklist runs in under an hour. It catches the most consequential classes of failure. It doesn't require engineering expertise; it requires operator discipline.

If you're already running some form of audit but skipping dimensions, identify which ones get skipped and why. Often the skip is because the operator doesn't know what to check; the operator-level checks above give you something specific to run.

The shift is from "trust the AI's output" to "verify the AI's output along these specific dimensions." The verification is bounded. The discipline is what makes AI-generated code genuinely shippable to production.

If you're shipping AI-generated code without a structured audit and want help installing the gate, send the workflow you're using, the kinds of changes you're shipping, and the production surface. VibeKoded can scope a rescue diagnostic, stabilization sprint, or rebuild plan. → Work with VibeKoded

// part of the ai project rescue topic

// grab the free starter kit that makes your AI stop forgetting and stop guessing: get it →

// building with AI? the field manual has the structured lessons.

// hitting this on a real build? this is what I fix →

Dimension one: correctness

Dimension two: security

Dimension three: dependencies

Dimension four: performance

Dimension five: accessibility

The three-leg promotion gate

The operator's audit checklist

When to engage external review

What this means for current AI-assisted work

// see also

Why passing tests does not mean the app is fixed

Is your vibe coded app quietly wide open?

Autonomous trader, pulled back from a failure mode