Why your Lighthouse score keeps changing on mobile

I ran the same build through Lighthouse twice in a row and got two different scores. Sixty-eight, then fifty-eight, on the same machine, with nothing changed between the runs except that I had clicked the button again. Ten points is not rounding error. Ten points is the difference between "ship it" and "something is wrong," and I had just watched a single build be on both sides of that line in under a minute.

the number is not the verdict

The instinct, when a score comes back low, is to treat it as a grade and start fixing whatever the report points at. On a vibe coded site the report tends to point at the most visible thing on the page. The hero animation. The big background image. The element that looks expensive, because it looks expensive. So you optimize that. The score moves a little, or it does not move at all, and now you genuinely cannot tell whether your change did anything or whether you just caught a friendlier run. You are tuning against an instrument that disagrees with itself, and every decision you make on top of it inherits the noise.

the scary thing is usually not the bottleneck

On one site the obvious suspect was a heavy animated background, a fragment shader doing real work every frame. It looked like the bottleneck precisely because it was the most complicated thing on the page, and complicated reads as slow. The first time I actually measured instead of guessed, the trace put the cost somewhere else entirely. Roughly 1,715 milliseconds of main-thread JavaScript was blocking the page before it could respond to a tap, while the shader, the scary one, was nowhere near the top of the cost list. If I had spent that afternoon optimizing the shader I would have bought nothing, because I would have been sharpening a thing that was not dull. The visible suspect and the actual cause were not the same element, and only the measurement could tell them apart.

why the score will not sit still

Mobile lab scores are bimodal, and once you know that, the swing stops being mysterious. A phone-class CPU under throttling lands in one of two rough clusters depending on thermal state, background contention, and how the run happened to schedule its work. A single Lighthouse run samples one cluster. Run it again and you can land in the other one. That is the 68 then 58: not two different builds, one build sampled twice from a distribution. The composite score makes it worse, because it compresses several metrics into one number, and then that one number rides whichever cluster you caught. So the question "which Lighthouse score do I trust" has a real answer, and the answer is none of them, taken alone.

measure the thing that actually moves

Two changes fix the instrument. First, stop reading the single composite number and gate on Total Blocking Time. TBT tracks main-thread work, which is the layer that was actually holding the page down, so it points at the real problem instead of averaging it into a grade. Second, take the median of several runs. Five is enough. One run is a sample, not a measurement, and treating a sample like a measurement is the whole reason the number felt random. Once the gate is TBT and the value is a median of five, the instrument steadies, and a change that helps shows up as a real move instead of a coin flip you talk yourself into believing.

broken on mobile and the swinging score are the same bug

These feel like two separate complaints. They have one root. Both come from reading noise as signal and optimizing a layer that is not the bottleneck. The generator builds and previews against the desktop viewport sitting in the IDE, where the main thread is fast, the network is local, and the CPU never throttles. "Looks great here" gets measured on the one device that will never be the problem, while your visitors are on phones. That is the same mechanism behind why AI websites load slow and most of how to fix an AI website that is not mobile friendly: the build was authored and judged on hardware that flatters it. The fix for both is one move. Diagnose before you fix. Measure which layer costs, on the device that actually matters, with an instrument that is not itself noise, before you touch a single line.

the failure mode to watch

The expensive mistake is chasing one run. You see a single bad number, you panic, and you defer the hero animation past the largest paint to buy a "mobile win." You ship it. The next run was going to be fine anyway, because the bad one was the unlucky cluster. Now you have a dead scroll position or a flash of motion that fires late, traded for a number that was never real to begin with. Deferring real work to game a metric you measured exactly once is how you make the site worse while the dashboard says better. The thing about vibe coding and performance is that the feedback loop lies to you by default, and a confident wrong number is more dangerous than no number, because you act on it.

what to actually do

The moves are small, and they all point in one direction: measure first, optimize second.

Take the median of five runs, never one. A single run is a sample from a noisy distribution, not a reading you can build on.

Gate on Total Blocking Time for mobile, not the composite score. The composite hides which layer is the problem; TBT names it.

Open the trace and find the long tasks before you optimize anything. The scariest-looking element on the page is rarely where the time goes, and auditing the build before you act on it is cheaper than optimizing the wrong thing twice.

Measure on a throttled phone profile, not the desktop the generator previewed against. The machine the build was authored on is the one machine that is never slow.

Only after you know the bottleneck, optimize it. A fix aimed at the wrong layer is worse than no fix, because it costs you something real and buys you nothing.

If you're a vibe coder scaling personal site work into client builds, and the mobile numbers will not hold still long enough to tell you what to fix, and you want a sparring partner on the production jump, /work-with-us.

The score was never the point. It is a proxy for whether a person holding a phone gets something usable before they give up and close the tab. Measure that, on the device they are actually holding, with an instrument steady enough to trust, and the score stops being something you chase and starts being something that follows.