{"id":2655,"date":"2026-03-11T09:30:00","date_gmt":"2026-03-11T09:30:00","guid":{"rendered":"https:\/\/mugnos-it.com\/?p=2655"},"modified":"2026-03-03T12:58:54","modified_gmt":"2026-03-03T12:58:54","slug":"what-is-the-importance-of-defining-good-slis-slos-and-slas","status":"publish","type":"post","link":"https:\/\/mugnos-it.com\/pt\/what-is-the-importance-of-defining-good-slis-slos-and-slas\/","title":{"rendered":"What is the importance of defining good SLIs, SLOs, and SLAs?"},"content":{"rendered":"<div data-elementor-type=\"wp-post\" data-elementor-id=\"2655\" class=\"elementor elementor-2655\" data-elementor-post-type=\"post\">\n\t\t\t\t<div class=\"elementor-element elementor-element-38ca1016 e-flex e-con-boxed e-con e-parent\" data-id=\"38ca1016\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-40056fff elementor-widget elementor-widget-text-editor\" data-id=\"40056fff\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>\u201cCan you take a look at the service for me?\u201d<\/p>\n\n\n\n<p>If you\u2019ve ever heard that in the middle of a sprint, incident, or code review\u2026 you know how frustrating that sentence can be.<\/p>\n\n\n\n<p>Check what? Based on what? Just because the API returned <code>200<\/code> and the front-end buttons rendered, does that really mean everything is healthy?<\/p>\n\n\n\n<p>We need to stop guessing. And start defining <strong>what a healthy system actually means<\/strong>.<\/p>\n\n\n\n<p>That\u2019s where the famous (and often ignored) concepts come in:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>SLI, SLO and SLA<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">What are SLI, SLO and SLA? (With real examples)<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">SLI \u2013 <strong>Service Level Indicator<\/strong><\/h3>\n\n\n\n<p>It\u2019s the <strong>raw metric<\/strong> you\u2019re observing. Example:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>SLI Name<\/th><th>What it measures<\/th><th>Practical example<\/th><\/tr><\/thead><tbody><tr><td>Error rate<\/td><td>% of HTTP requests returning errors<\/td><td>Requests to <code>\/checkout<\/code> with 5xx status<\/td><\/tr><tr><td>Availability<\/td><td>% of time an endpoint is accessible<\/td><td>Availability of <code>\/api\/auth\/login<\/code><\/td><\/tr><tr><td>Latency<\/td><td>Response time of requests<\/td><td>Time to receive response from <code>\/payment\/process<\/code><\/td><\/tr><tr><td>Throughput<\/td><td>Volume of processed requests<\/td><td>Number of requests\/sec on <code>\/orders\/list<\/code><\/td><\/tr><tr><td>Saturation<\/td><td>Usage of critical system resources<\/td><td>Kafka order queue above 80% occupancy<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">SLO \u2013 <strong>Service Level Objective<\/strong><\/h3>\n\n\n\n<p>It\u2019s the <strong>internal technical agreement<\/strong> about the minimum acceptable value for that SLI. Example:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>SLI<\/th><th>Suggested SLO<\/th><\/tr><\/thead><tbody><tr><td>Error rate of <code>\/checkout<\/code><\/td><td>Less than 0.1% 5xx errors over 30 days<\/td><\/tr><tr><td>Availability of <code>\/auth\/login<\/code><\/td><td>99.95% monthly uptime<\/td><\/tr><tr><td>Latency on <code>\/payment\/process<\/code><\/td><td>95% of responses under 300ms<\/td><\/tr><tr><td>Throughput on <code>\/orders\/list<\/code><\/td><td>At least 100 sustained RPS without errors<\/td><\/tr><tr><td>Kafka payment queue saturation<\/td><td>Queue must never exceed 80% for more than 5 consecutive minutes<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">SLA \u2013 <strong>Service Level Agreement<\/strong><\/h3>\n\n\n\n<p>It\u2019s the <strong>formal agreement with a client or business team<\/strong>. Example:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>\u201cIf monthly availability of \/auth\/login drops below 99.5%, there will be a contractual penalty of X dollars.\u201d<\/p>\n<\/blockquote>\n\n\n\n<p>We won\u2019t go deep into SLAs here, because the focus is helping you build real visibility and strong technical agreements within your team using SLI + SLO.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">A Practical Example (With Template)<\/h2>\n\n\n\n<p>Imagine you have a <strong>payment API<\/strong> running in containers on Kubernetes. Below is a table with real SLI\/SLO examples you can use as a base to implement <strong>observability + alerts<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Metric (SLI)<\/th><th>Objective (SLO)<\/th><th>Where to measure<\/th><th>Alert Type<\/th><th>Priority<\/th><\/tr><\/thead><tbody><tr><td>HTTP 5xx error rate on <code>\/payment\/checkout<\/code><\/td><td>&lt; 0.1% over last 7 days<\/td><td>API Gateway \/ APM (Datadog, Prometheus, etc.)<\/td><td>Alert if &gt; 0.1% for 15 min<\/td><td>High<\/td><\/tr><tr><td>Latency on <code>\/payment\/process<\/code><\/td><td>95% &lt; 300ms (p95)<\/td><td>Distributed tracing + Logs<\/td><td>Alert if p95 &gt; 300ms for 5 min<\/td><td>High<\/td><\/tr><tr><td>Availability of <code>payment-api<\/code> container<\/td><td>99.95% monthly<\/td><td>Kubernetes healthcheck + Prometheus<\/td><td>Alert if container crashing\/restarting<\/td><td>High<\/td><\/tr><tr><td>CPU usage of container<\/td><td>&lt; 70% sustained<\/td><td>Prometheus \/ Grafana<\/td><td>Alert if &gt; 70% for 10 min<\/td><td>Medium<\/td><\/tr><tr><td>Memory usage of container<\/td><td>&lt; 75% sustained<\/td><td>Prometheus \/ Grafana<\/td><td>Alert if &gt; 75% for 10 min<\/td><td>Medium<\/td><\/tr><tr><td>Kafka payment queue usage<\/td><td>&lt; 80% buffer<\/td><td>Kafka Exporter + Prometheus<\/td><td>Alert if &gt; 80% for 10 min<\/td><td>High<\/td><\/tr><tr><td>Kafka backlog<\/td><td>&lt; 1,000 delayed messages<\/td><td>Kafka metrics<\/td><td>Alert if backlog grows for 10 min<\/td><td>High<\/td><\/tr><tr><td>Database availability<\/td><td>&gt; 99.9% weekly<\/td><td>DB Proxy or APM monitoring<\/td><td>Alert if unavailable for &gt; 1 min<\/td><td>High<\/td><\/tr><tr><td>Auth request error rate (<code>\/auth<\/code>)<\/td><td>&lt; 0.2% over 7 days<\/td><td>API Gateway or Auth Service<\/td><td>Alert if spike &gt; 0.2% for 15 min<\/td><td>High<\/td><\/tr><tr><td>Total service throughput<\/td><td>Sustain &gt; 100 RPS stably<\/td><td>APM + Load Balancer metrics<\/td><td>Alert if abrupt drop in throughput<\/td><td>High<\/td><\/tr><tr><td>Internal job queue time (e.g., invoice generation)<\/td><td>&lt; 1s average<\/td><td>Job runner metrics or Prometheus<\/td><td>Alert if average &gt; 1s for 5 min<\/td><td>Medium<\/td><\/tr><tr><td>p99 latency on <code>\/refunds<\/code><\/td><td>&lt; 500ms<\/td><td>APM or tracing<\/td><td>Alert if p99 &gt; 500ms for 10 min<\/td><td>Medium<\/td><\/tr><tr><td>External call errors (e.g., payment gateway)<\/td><td>&lt; 0.5%<\/td><td>Circuit breaker + logs<\/td><td>Alert on spike or constant timeout<\/td><td>High<\/td><\/tr><tr><td>Automatic retry rate<\/td><td>&lt; 2% of requests<\/td><td>Retry middleware<\/td><td>Alert if &gt; 2% for 15 min<\/td><td>Low<\/td><\/tr><tr><td>Event deserialization\/parse errors<\/td><td>0 invalid events<\/td><td>Kafka consumer + logs<\/td><td>Alert if invalid event received<\/td><td>High<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Now it\u2019s on you:<\/h2>\n\n\n\n<p>The table above is <strong>just an example<\/strong>. Every system has its own characteristics, critical points, and specific needs.<\/p>\n\n\n\n<p><strong>What you can (and should) do now:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Pick one microservice from your system<\/li>\n\n\n\n<li>List its main endpoints and responsibilities<\/li>\n\n\n\n<li>Define 5 to 10 real <strong>SLIs<\/strong> that represent what \u201chealth\u201d means there<\/li>\n\n\n\n<li>For each SLI, define an <strong>SLO<\/strong><\/li>\n\n\n\n<li>Configure monitoring and <strong>alerts based on those SLOs<\/strong><\/li>\n\n\n\n<li>Share it with the team (it\u2019s useless if only you know it)<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">To wrap up: what if this became culture?<\/h2>\n\n\n\n<p>Now imagine this\u2026<\/p>\n\n\n\n<p>You have all your <strong>SLIs clearly defined<\/strong>, your <strong>SLOs visible on a dashboard<\/strong>, alerts configured with clear criteria \u2014 and the whole team knowing <strong>exactly what a healthy system looks like<\/strong>.<\/p>\n\n\n\n<p>It would be much easier to make decisions, right?<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Knowing when it\u2019s time to act (and when it\u2019s not)<\/li>\n\n\n\n<li>Having clarity about system health without relying on assumptions<\/li>\n\n\n\n<li>Avoiding repetitive and meaningless work (the famous <em>toil work<\/em> from SRE principles)<\/li>\n\n\n\n<li>Focusing on what really matters: delivering value with stability<\/li>\n<\/ul>\n\n\n\n<p><strong>SLO is not meant to become a forgotten spreadsheet.<\/strong> It\u2019s meant to be <strong>used every single day<\/strong> as the compass for reliability.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Want to learn MORE?<\/h2>\n\n\n\n<p>honestly? I don\u2019t think you should be doing all of this manually anymore.<\/p>\n\n\n\n<p>I\u2019m about to launch a new training called <strong>\u201cSRE Efficient: How AI Transforms Reliability Engineering\u201d<\/strong>, and in one of the classes I show exactly how you can use LLMs to help you define SLIs, SLOs \u2014 and even generate incident summaries in minutes.<\/p>\n\n\n\n<p>Yes. Minutes.<\/p>\n\n\n\n<p>If you want to see how AI can amplify your reliability practice instead of just generating alerts and dashboards, stay tuned.<\/p>\n\n\n\n<p>More details coming very soon. \ud83d\ude80<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-cdb9143 e-flex e-con-boxed e-con e-parent\" data-id=\"cdb9143\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>","protected":false},"excerpt":{"rendered":"<p>\u201cCan you take a look at the service for me?\u201d If you\u2019ve ever heard that in the middle of a sprint, incident, or code review\u2026 you know how frustrating that sentence can be. Check what? Based on what? Just because the API returned 200 and the front-end buttons rendered, does that really mean everything is [&hellip;]<\/p>","protected":false},"author":3,"featured_media":2656,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-2655","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/Vem-conferir-minha-pagina-de-treinamentos-2.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts\/2655","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/comments?post=2655"}],"version-history":[{"count":4,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts\/2655\/revisions"}],"predecessor-version":[{"id":2660,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts\/2655\/revisions\/2660"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/media\/2656"}],"wp:attachment":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/media?parent=2655"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/categories?post=2655"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/tags?post=2655"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}