The only latency budget I've ever stuck to

Every latency budget I’ve ever written on a design doc has quietly rotted in a corner of the wiki. The one that stuck wasn’t clever — it was a single p99 number, pinned to a single endpoint, owned by a single team, and enforced by a single dashboard nobody could silently delete. This post is the story of why that one stuck when the others didn’t.

The ones that didn’t stick

Every team I’ve joined has had a latency budget somewhere. Usually in a design doc from 18 months ago, three owners ago, titled something like Performance SLOs, v2. It typically had:

A table of 40 endpoints, each with a p50/p95/p99/p99.9 target.
A “service-level objective” row per service.
A rollup “product-level experience” metric.
A note at the bottom that said “these are targets, not guarantees.”

Every row was reasonable. Nobody could argue with any individual number. And yet every time I ran a query against the actual numbers in production, we were breaching half of them quietly — no alerts, no follow-up, no conversation. The budget was decorative.

The failure mode is not that the numbers were wrong. The failure mode is that nobody owned any of them loudly enough. A number that’s nobody’s problem is nobody’s problem.

The one that stuck

On one particular platform — one of those event-driven modernization efforts I did with a multi-squad team — we kept trying to write The Proper SLO Document. It never survived a sprint. Eventually we gave up and did something that felt too small to work.

We picked one endpoint. The one everybody had an opinion about — the join-a-match flow in a real-time game backend. We wrote one number on the whiteboard: p99 < 250ms. We put one person’s name next to it. We created one dashboard URL. We pinned the dashboard URL in the team’s Slack channel and added it to the on-call handoff template.

That was the whole budget. One number. One endpoint. One owner. One dashboard.

It stuck for three years. We missed it, we recovered, we breached it again, we fixed it again. The number moved (first down to 180ms, then back up to 220ms when we added a feature that was worth the cost). But it was always on the agenda — because there was only one of it, and everybody knew who to ping.

Why the simpler version won

Four reasons, in order of how often I underestimated them:

1. Cognitive load. Forty numbers is fewer than you’d think for a machine and more than a human can sustain attention on. One number is on a sticker on someone’s laptop. Forty numbers are in a table in a document in a folder in a wiki.

2. Ownership clarity. “The platform team owns it” is not ownership. It’s bystander effect with a compliance layer. “Julia owns p99 on join-a-match” means Julia gets pinged. Julia either fixes it or explicitly hands it off. The number has a face.

3. Alerting fit. You can set up a good alert for one number. You cannot set up a good alert for forty correlated numbers — you either over-fire or under-fire, and you won’t know which until the pager is already a folk tax. (See my last post on why folk-taxing a pager kills your rotation.)

4. Social proof during breaches. When the number breaches, people recognise the number. “p99 on join-a-match just popped to 380” is a sentence people nod at. “The platform latency SLO is yellow in quadrant 3 of the grafana dashboard” is a sentence that hides. You cannot mobilise around a sentence that hides.

The question that fixed our meeting hygiene

After we adopted the one-number approach, we kept a single recurring question in every weekly: “Is the number red, yellow, or green?”

Green: skip it. No status theatre.
Yellow: the owner says one sentence about what they’re doing.
Red: we drop the agenda and make a plan.

This replaced maybe 30 minutes of “performance status updates” per week with a 60-second check-in. It also meant that when things were actually bad, we noticed within a week instead of within a quarter.

What the one number is not

It’s not the whole picture. There are dozens of other metrics you still care about. They live in dashboards, in alerts, in post-mortems. The one number is the spine; everything else is muscle.
It’s not a target for every endpoint. 90% of your endpoints don’t deserve a budget. A budget is expensive to maintain — give it to the ones that actually matter to the product, then forget about the rest until they show up in your error logs.
It’s not static. You will raise and lower it. Both are fine. A budget that never moves is a budget nobody’s looking at.

How to start

If you’re considering writing a proper performance SLO document right now: don’t. Instead, tonight, before you forget:

Pick the single most user-visible endpoint in your product.
Pull its p99 for the last 30 days.
Round up. That’s your starting number.
Put one person’s name next to it. Probably yours.
Pin the dashboard URL in your team’s most-used channel.
Add “what’s the number?” to your weekly.

That’s week one. Week two, you either notice the number is wrong, or you notice it’s fine. Either way, you are now owning it. That’s farther than most SLO documents ever get.

Peacock moment

I’ll say the quiet part: this approach also made me look very good very fast. “Mo cut our p99 by 63%” is a headline my manager could say out loud in a performance review. “Mo maintained the SLO document” is not. Pick the quantified number, claim the quantified improvement. Your manager cannot promote a document.