What is the importance of defining good SLIs, SLOs, and SLAs?

“Can you take a look at the service for me?”

If you’ve ever heard that in the middle of a sprint, incident, or code review… you know how frustrating that sentence can be.

Check what? Based on what? Just because the API returned 200 and the front-end buttons rendered, does that really mean everything is healthy?

We need to stop guessing. And start defining what a healthy system actually means.

That’s where the famous (and often ignored) concepts come in:

SLI, SLO and SLA


What are SLI, SLO and SLA? (With real examples)

SLI – Service Level Indicator

It’s the raw metric you’re observing. Example:

SLI NameWhat it measuresPractical example
Error rate% of HTTP requests returning errorsRequests to /checkout with 5xx status
Availability% of time an endpoint is accessibleAvailability of /api/auth/login
LatencyResponse time of requestsTime to receive response from /payment/process
ThroughputVolume of processed requestsNumber of requests/sec on /orders/list
SaturationUsage of critical system resourcesKafka order queue above 80% occupancy

SLO – Service Level Objective

It’s the internal technical agreement about the minimum acceptable value for that SLI. Example:

SLISuggested SLO
Error rate of /checkoutLess than 0.1% 5xx errors over 30 days
Availability of /auth/login99.95% monthly uptime
Latency on /payment/process95% of responses under 300ms
Throughput on /orders/listAt least 100 sustained RPS without errors
Kafka payment queue saturationQueue must never exceed 80% for more than 5 consecutive minutes

SLA – Service Level Agreement

It’s the formal agreement with a client or business team. Example:

“If monthly availability of /auth/login drops below 99.5%, there will be a contractual penalty of X dollars.”

We won’t go deep into SLAs here, because the focus is helping you build real visibility and strong technical agreements within your team using SLI + SLO.


A Practical Example (With Template)

Imagine you have a payment API running in containers on Kubernetes. Below is a table with real SLI/SLO examples you can use as a base to implement observability + alerts.

Metric (SLI)Objective (SLO)Where to measureAlert TypePriority
HTTP 5xx error rate on /payment/checkout< 0.1% over last 7 daysAPI Gateway / APM (Datadog, Prometheus, etc.)Alert if > 0.1% for 15 minHigh
Latency on /payment/process95% < 300ms (p95)Distributed tracing + LogsAlert if p95 > 300ms for 5 minHigh
Availability of payment-api container99.95% monthlyKubernetes healthcheck + PrometheusAlert if container crashing/restartingHigh
CPU usage of container< 70% sustainedPrometheus / GrafanaAlert if > 70% for 10 minMedium
Memory usage of container< 75% sustainedPrometheus / GrafanaAlert if > 75% for 10 minMedium
Kafka payment queue usage< 80% bufferKafka Exporter + PrometheusAlert if > 80% for 10 minHigh
Kafka backlog< 1,000 delayed messagesKafka metricsAlert if backlog grows for 10 minHigh
Database availability> 99.9% weeklyDB Proxy or APM monitoringAlert if unavailable for > 1 minHigh
Auth request error rate (/auth)< 0.2% over 7 daysAPI Gateway or Auth ServiceAlert if spike > 0.2% for 15 minHigh
Total service throughputSustain > 100 RPS stablyAPM + Load Balancer metricsAlert if abrupt drop in throughputHigh
Internal job queue time (e.g., invoice generation)< 1s averageJob runner metrics or PrometheusAlert if average > 1s for 5 minMedium
p99 latency on /refunds< 500msAPM or tracingAlert if p99 > 500ms for 10 minMedium
External call errors (e.g., payment gateway)< 0.5%Circuit breaker + logsAlert on spike or constant timeoutHigh
Automatic retry rate< 2% of requestsRetry middlewareAlert if > 2% for 15 minLow
Event deserialization/parse errors0 invalid eventsKafka consumer + logsAlert if invalid event receivedHigh

Now it’s on you:

The table above is just an example. Every system has its own characteristics, critical points, and specific needs.

What you can (and should) do now:

  1. Pick one microservice from your system
  2. List its main endpoints and responsibilities
  3. Define 5 to 10 real SLIs that represent what “health” means there
  4. For each SLI, define an SLO
  5. Configure monitoring and alerts based on those SLOs
  6. Share it with the team (it’s useless if only you know it)

To wrap up: what if this became culture?

Now imagine this…

You have all your SLIs clearly defined, your SLOs visible on a dashboard, alerts configured with clear criteria — and the whole team knowing exactly what a healthy system looks like.

It would be much easier to make decisions, right?

  • Knowing when it’s time to act (and when it’s not)
  • Having clarity about system health without relying on assumptions
  • Avoiding repetitive and meaningless work (the famous toil work from SRE principles)
  • Focusing on what really matters: delivering value with stability

SLO is not meant to become a forgotten spreadsheet. It’s meant to be used every single day as the compass for reliability.


Want to learn MORE?

honestly? I don’t think you should be doing all of this manually anymore.

I’m about to launch a new training called “SRE Efficient: How AI Transforms Reliability Engineering”, and in one of the classes I show exactly how you can use LLMs to help you define SLIs, SLOs — and even generate incident summaries in minutes.

Yes. Minutes.

If you want to see how AI can amplify your reliability practice instead of just generating alerts and dashboards, stay tuned.

More details coming very soon. 🚀

guest
0 Comentários
Mais Velhos
Mais Novos Mais Votados
Inline Feedbacks
Veja todos comentários
0
Gostaria muito de saber sua opinião!x