{"id":2729,"date":"2026-04-15T09:30:00","date_gmt":"2026-04-15T09:30:00","guid":{"rendered":"https:\/\/mugnos-it.com\/?p=2729"},"modified":"2026-03-24T12:29:34","modified_gmt":"2026-03-24T12:29:34","slug":"leveraging-ai-to-define-slis-and-slos","status":"publish","type":"post","link":"https:\/\/mugnos-it.com\/pt\/leveraging-ai-to-define-slis-and-slos\/","title":{"rendered":"Leveraging AI to define SLIs and SLOs"},"content":{"rendered":"<div data-elementor-type=\"wp-post\" data-elementor-id=\"2729\" class=\"elementor elementor-2729\" data-elementor-post-type=\"post\">\n\t\t\t\t<div class=\"elementor-element elementor-element-739fda14 e-flex e-con-boxed e-con e-parent\" data-id=\"739fda14\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-1d4bfd3e elementor-widget elementor-widget-text-editor\" data-id=\"1d4bfd3e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p>Following up on last week&#8217;s analysis of Evernote&#8217;s transition to a Site Reliability Engineering (SRE) model, today we address the most common bottleneck teams face when adopting these practices: the &#8220;blank page syndrome.&#8221;<\/p>\n\n\n\n<p>It is easy to understand the theory behind Error Budgets, but translating a complex system architecture into precise, mathematically sound Service Level Indicators (SLIs) and Objectives (SLOs) is a time-consuming engineering challenge. This is where Artificial Intelligence, specifically Large Language Models (LLMs) orchestrated through programmatic pipelines, becomes a powerful operational tool.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>The SRE Bottleneck<\/strong><\/p>\n\n\n\n<p>Defining SLIs\/SLOs requires a deep understanding of the user&#8217;s critical journey and the underlying infrastructure. Engineers often spend weeks analyzing architecture diagrams, API specifications, and historical incident reports just to map out the Four Golden Signals (Latency, Traffic, Errors, and Saturation).<\/p>\n\n\n\n<p>Instead of doing this manually, you can leverage AI to parse system documentation and output production-ready reliability metrics.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>How to reduce Bottleneck<\/strong><\/p>\n\n\n\n<p>There are ways to make your own agent that can generate SLIs\/SLOS for your architecture components, using tools like LangChain, you can create agents that can return more precise results.<\/p>\n\n\n\n<p>However, what we need most of the time is not an agent that can do all the job, but an initial draft or starting point to fight the blank page syndrome.<\/p>\n\n\n\n<p>A simple, yet effective way to help with SLI\/SLO generation is by using prompt engineering, where you create a prompt full of context about your situation so you have a more precise answer.<\/p>\n\n\n\n<p>Prompt engineering is any prompt that can be used on AI tools like ChatGPT, Gemini, Grok, etc.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>But what is the context needed?<\/strong><\/p>\n\n\n\n<p>Since we are interacting with an LLM, it needs context to define the next token, however, if you don\u2019t make details clear, thing like, components, code, and observability tools you are using, there\u2019s a higher chance of getting AI hallucinations or unsatisfactory results.<\/p>\n\n\n\n<p>Now, when we add context and details about your infrastructure, tying together with a desired output, like a SLI\/SLO, the answer quality is vastly superior.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>A Practical Example: Prompting for Reliability<\/strong><\/p>\n\n\n\n<p>Instead of asking a generic question like &#8220;What should my SLOs be?&#8221;, you must structure your prompt to enforce SRE constraints. Here is a prompt template idea you can use in your pipeline:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>ROLE<\/strong>: You are a Senior Site Reliability Engineer with deep experience in observability and production systems.<\/p>\n\n\n\n<p><strong>GOAL<\/strong>: Define the Service Level Indicators (SLIs) and Service Level Objectives (SLOs) based strictly on the Four Golden Signals (Latency, Traffic, Errors, Saturation). For each SLI, specify the exact telemetry to collect, the measurement mechanism, and suggested thresholds for WARNING and ALERT.<\/p>\n\n\n\n<p><strong>CONTEXT<\/strong>: Analyze the following system architecture:<\/p>\n\n\n\n<p>A simple event-driven application composed of three components:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Producer Service<\/strong> \u2013 generates events and sends messages to a queue. (Language GO)<\/li>\n\n\n\n<li><strong>Queue Service (AWS SQS)<\/strong> \u2013 buffers and stores messages until they are processed.<\/li>\n\n\n\n<li><strong>Consumer Service<\/strong> \u2013 reads messages from the queue and processes them.(Language GO)<\/li>\n<\/ul>\n\n\n\n<p>The producer publishes messages continuously to SQS.<\/p>\n\n\n\n<p>The consumer polls the queue and processes each message.<\/p>\n\n\n\n<p>Failures or delays in any component may affect message processing latency and system reliability.<\/p>\n\n\n\n<p><strong>CONSTRAINTS<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Only use metrics that can realistically be collected from common telemetry sources (logs, metrics, traces, queue metrics, infrastructure metrics).<\/li>\n\n\n\n<li>Prefer percentiles for latency (e.g., p95, p99).<\/li>\n\n\n\n<li>Define thresholds that would indicate real user or system impact.<\/li>\n\n\n\n<li>Avoid generic or abstract indicators.<\/li>\n<\/ul>\n\n\n\n<p><strong>OUTPUT FORMAT<\/strong>: Create a table with the following columns:<\/p>\n\n\n\n<p>Component | Golden Signal | SLI | Telemetry \/ Metric | Measurement Method | Warning Threshold | Alert Threshold | Why It Matters<\/p>\n<\/blockquote>\n\n\n\n<p>By providing this structured prompt, the AI stops guessing and starts engineering. It forces the LLM to map your specific architecture to industry-standard metrics, calculating the exact error budget and alert thresholds required for your environment.<\/p>\n\n\n\n<p>This is a template you should use, adding more information for better results. If you fail to do so, you will have a standard case of \u201c<strong>Garbage In, Garbage Out<\/strong>\u201d<\/p>\n\n\n\n<p>Still, even in this example that have a low amount of information, there\u2019s already an output that can be used as a direction to define what needs to be monitored(SLOs).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"557\" src=\"https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/image-9-1024x557.png\" alt=\"\" class=\"wp-image-2730\" srcset=\"https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/image-9-1024x557.png 1024w, https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/image-9-300x163.png 300w, https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/image-9-768x418.png 768w, https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/image-9-1536x835.png 1536w, https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/image-9-2048x1114.png 2048w, https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/image-9-18x10.png 18w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>For this prompt, we used GPT-5-AUTO, it answered with a lot of relevant metrics, and, without a doubt, for the architecture provided in the prompt, it was pretty assertive.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>The Outcomes<\/strong><\/p>\n\n\n\n<p>By orchestrating AI for this specific task, engineering teams achieve standardization across microservices. It eliminates subjective debates over what should be monitored and ensures that every new service deployed comes with a baseline set of SLIs and alerting thresholds derived strictly from industry-standard SRE mathematics. The outputted JSON can then be directly integrated into provisioning tools to automatically create Datadog or Prometheus dashboards.<\/p>\n\n\n\n<p>If you&#8217;re enjoying this content, we&#8217;d love your feedback \u2014 and stay tuned, because there&#8217;s much more to come!<\/p>\n\n\n\n<p>See you!!<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-7f2656f e-flex e-con-boxed e-con e-parent\" data-id=\"7f2656f\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>","protected":false},"excerpt":{"rendered":"<p>Following up on last week&#8217;s analysis of Evernote&#8217;s transition to a Site Reliability Engineering (SRE) model, today we address the most common bottleneck teams face when adopting these practices: the &#8220;blank page syndrome.&#8221; It is easy to understand the theory behind Error Budgets, but translating a complex system architecture into precise, mathematically sound Service Level [&hellip;]<\/p>","protected":false},"author":3,"featured_media":2731,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-2729","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/ChatGPT-Image-24-de-mar.-de-2026-09_08_24.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts\/2729","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/comments?post=2729"}],"version-history":[{"count":4,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts\/2729\/revisions"}],"predecessor-version":[{"id":2735,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts\/2729\/revisions\/2735"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/media\/2731"}],"wp:attachment":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/media?parent=2729"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/categories?post=2729"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/tags?post=2729"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}