Reliability Engineering

Designing SLOs That Survive Contact With Production

Most SLOs are written once, pasted into a wiki, and never looked at again. Here is how to define objectives that actually change what engineers do.

UVExcel Tech28 Apr 202610 min read

Service level objectives have a credibility problem. Many organizations have them, almost none use them. They are written during an architecture review, set at a comfortable 99.9% because the number looks responsible, and then forgotten until an executive asks why the dashboard is red. An SLO that never influences a decision is not a reliability tool; it is decoration. The objective of this piece is to describe SLOs that survive contact with production — ones that map to real user experience and earn their place in engineering conversations.

Start from the user's experience, not the server's

The foundational mistake is measuring what is easy rather than what matters. CPU utilization, host uptime, and process liveness are convenient to collect and almost irrelevant to whether a customer is having a good time. A service level indicator — the measurement underneath an SLO — should reflect something a user would actually notice: did the request succeed, and did it return quickly enough to be useful. Availability and latency, measured at the edge of the service from the consumer's perspective, are worth more than a wall of infrastructure metrics.

Concretely, a good SLI is a ratio of good events to total events: the proportion of requests that completed successfully under a latency threshold, over a rolling window. The threshold is a product decision, not a technical one. If users abandon a page after a second, then a request that takes three seconds is a failure even though it returned a 200, and your SLI should count it as such.

The error budget is the whole point

An SLO of 99.9% over thirty days is not really a statement about uptime — it is a statement that you are allowed to be unavailable for about forty-three minutes in that window. That allowance is the error budget, and it is the mechanism that turns reliability from an argument into a number. When the budget is healthy, teams should feel free to ship aggressively, run risky migrations, and experiment. When the budget is nearly exhausted, the same teams should slow down, freeze risky changes, and spend their effort on stability. The budget converts the perennial tug-of-war between feature velocity and reliability into a shared, data-driven rule.

If a blown error budget never changes a release decision, you do not have an SLO — you have a chart. The budget must have teeth that everyone, including product, has agreed to in advance.

Set the target where it actually matters

More nines are not better; they are exponentially more expensive and frequently invisible to users. The right target is the point at which additional reliability stops changing user behavior. If your dependency chain — DNS, CDN, the user's own network — already imposes more unreliability than your service does, chasing an extra nine inside your service is wasted money. Pick the loosest objective your users will not notice, then hold yourself to it ruthlessly. A 99.9% target you honor is worth more than a 99.99% target you quietly miss.

Make them living documents

SLOs decay. Traffic patterns shift, new endpoints appear, a dependency changes its behavior, and the threshold that made sense last quarter no longer reflects reality. Review SLOs on a cadence — monthly is reasonable for active services — and treat persistent over-achievement as a signal too. A service that has not touched its error budget in six months is either over-provisioned or running too conservatively, and that is a conversation worth having.

A practical starting sequence

1Identify the two or three user journeys that matter most for each critical service.
2Define SLIs as good-events-over-total-events for the availability and latency of those journeys, measured from the consumer's side.
3Set an initial SLO target based on current performance and user tolerance — not aspiration.
4Wire the error budget into alerting and into your release policy, with agreed consequences when it burns too fast.
5Review and adjust on a fixed cadence, and retire indicators that no longer reflect user experience.

Done this way, an SLO stops being a number on a slide and becomes the quiet arbiter of dozens of daily decisions: whether to ship, whether to page, whether to invest in resilience or in features. That is the test of a good objective — not how impressive it looks, but how often it changes what your team actually does.

Key takeaways

Measure user-visible success and latency, not host-level metrics that customers never feel.
The error budget is the mechanism — it must carry agreed consequences for release decisions, or it is meaningless.
Pick the loosest target users won't notice and honor it, rather than chasing expensive, invisible nines.
Review SLOs on a cadence; persistent over-achievement is as informative as persistent failure.

Related insights

Reliability Engineering

Observability: The Three Pillars and What Comes Next

25 Jan 20269 min read

From reading to building

Want help putting these ideas into production?

We work alongside your team to architect, automate, and operate platforms that hold up under real load.

Book a Discovery Call