Averages hide what your users feel

I’ve worked on product teams at startups and large companies, and one pattern keeps showing up. Users complain that the product is slow. Engineering pulls up a dashboard, shows the average response time, and the number looks fine. 200 milliseconds. Someone says “must be a one-off” or “probably their network” and the meeting moves on. Then the same complaints come back the next week, and the week after that, and nobody can figure out why because the metrics say the system is healthy.

The metrics aren’t lying exactly, but they’re hiding something. Average latency tells you how the system performs in aggregate, and that sounds useful until you think about what it actually smooths over. If 95% of your requests come back in under 100 milliseconds and 5% take 3 seconds, the average still looks great. But nobody uses your product in aggregate. Every user gets one experience, and for a meaningful percentage of them, that experience is terrible. They don’t know they’re an outlier. They just know your product is slow.

Statistics is one of those things that’s easy to get wrong, even for really smart people. It was actually one of the more useful things I got out of my education, and it shows up in engineering conversations more than you’d expect. The average is the most misleading statistic of all when it comes to performance, because it hides the distribution behind a single comfortable number. And even the median, which a lot of teams use instead, has the same problem. It tells you what the typical user experiences, but it tells you nothing about the tail. The real picture lives in the percentiles. P50 is the median. P95 is what the worst 1 in 20 experiences. P99 is the worst 1 in 100.

Think about it like an airport security line. Most people get through in five minutes. But every now and then someone triggers a bag search or sets off the scanner and they’re stuck for twenty minutes. The average wait time might be seven minutes, which sounds fine in a report. But if you’re the person behind the bag search, your experience is twenty minutes. And airports care deeply about this, because if 5% of passengers wait twenty minutes, thousands of people miss flights. You don’t optimize an airport for the average passenger. You optimize it so that nobody misses their flight.

Software is the same way. When we actually started looking at percentiles on a system I was working on, the picture was completely different from what the averages suggested. Average: 200ms. P95: 1.2 seconds. P99: 8 seconds. Pmax: 25 seconds. The dashboard had been telling us everything was fine while some users were waiting nearly half a minute for a page to load. And when you do the math at scale, even “1 in 100” is not rare. If you’re handling 100,000 requests per minute, that’s 1,000 users every minute having a genuinely bad experience. Sixty thousand an hour. Those users don’t file support tickets. They just leave.

And P95 is not enough. At real scale, even P99 starts to matter in absolute numbers. If you have a million requests a day and your P99 is 8 seconds, that’s 10,000 users a day staring at a screen wondering if your product is broken. And somewhere in there, someone is waiting 25 seconds. If you’re building something that millions of people use, you have to look at the full distribution, because the average and even the median are traps that make you feel safe while a meaningful chunk of your users are having a terrible time.

This gets even worse in modern architectures where a single user request touches multiple services. Your API hits a database, checks a cache, calls an auth service, maybe talks to a third-party provider. Each of those has its own latency distribution, its own tail. If Service A has a P95 of 500 milliseconds and Service B also has a P95 of 500 milliseconds, the user’s experience isn’t 500 milliseconds. It’s worse, because the request needs both to finish, and the probability of hitting the tail of at least one goes up with every service in the chain. Google studied this problem extensively in “The Tail at Scale”, where fan-out across thousands of servers makes even a 0.01% tail event near-certain on every request. But you don’t need that kind of fan-out for this to matter. A B2B app with a few hundred users and three or four services in the chain will hit the same compounding math, just with smaller numbers that are easier to ignore and harder to prioritize.

None of this is earth-shattering. Engineers who work on performance know about percentiles. But I’ve found that the connection between percentile metrics and user experience often gets lost in the day-to-day. The dashboard shows a number, the number looks fine, and people move on. As a PM, part of your job is to champion the user experience, and one of the most practical ways to do that when working with engineering on performance problems is to push the conversation past the average and into the long tail. Ask what the P95 looks like. Ask about P99. Ask what happens to the users at the edges. Because averages describe the system, but percentiles describe what your users actually live through. And the gap between those numbers is where your product is quietly failing.

See it for yourself

Plug in your numbers. Percentages become people.

Scenario

Requests per minute

Average latency (ms)

P95 latency (ms)

P99 latency (ms)

What the dashboard says

200ms avg

Looks healthy

P95 — Real users waiting 1.2s+

5,000

every minute

P99 — Real users waiting 8s+

1,000

every minute

Bad experiences per day

1.4M

while your dashboard says everything is fine