{"id":2749,"date":"2026-05-06T09:30:00","date_gmt":"2026-05-06T09:30:00","guid":{"rendered":"https:\/\/mugnos-it.com\/?p=2749"},"modified":"2026-03-24T12:45:20","modified_gmt":"2026-03-24T12:45:20","slug":"what-are-the-four-golden-signals-quality-monitoring","status":"publish","type":"post","link":"https:\/\/mugnos-it.com\/pt\/what-are-the-four-golden-signals-quality-monitoring\/","title":{"rendered":"What are the &#8220;Four Golden Signals&#8221;? Quality Monitoring"},"content":{"rendered":"<div data-elementor-type=\"wp-post\" data-elementor-id=\"2749\" class=\"elementor elementor-2749\" data-elementor-post-type=\"post\">\n\t\t\t\t<div class=\"elementor-element elementor-element-17edfc7f e-flex e-con-boxed e-con e-parent\" data-id=\"17edfc7f\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-21e14ba0 elementor-widget elementor-widget-text-editor\" data-id=\"21e14ba0\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p><\/p>\n\n\n\n<p>What are the &#8220;Four Golden Signals&#8221;? Quality Monitoring<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>You&#8217;re drowning in metrics. Your monitoring dashboard has 47 panels. You&#8217;re getting paged for CPU spikes at 3 AM. And when something actually breaks, you still don&#8217;t know where to start looking.<\/p>\n\n\n\n<p>Sound familiar?<\/p>\n\n\n\n<p>This is exactly the problem the &#8220;Four Golden Signals&#8221; solve. These four metrics\u2014<strong>Latency<\/strong>, <strong>Traffic<\/strong>, <strong>Errors<\/strong>, and <strong>Saturation<\/strong>\u2014form the foundation of effective SRE monitoring. They&#8217;re not fancy. They&#8217;re not new. But they work because they answer the question every operator asks: <em>&#8220;What&#8217;s actually broken?&#8221;<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Four Golden Signals: Your Monitoring North Star<\/h2>\n\n\n\n<p>The Four Golden Signals come from Google&#8217;s Site Reliability Engineering book and represent the core metrics you <strong>must monitor<\/strong> in any system, regardless of complexity. They&#8217;re not about perfecting your observability stack\u2014they&#8217;re about starting with what matters.<\/p>\n\n\n\n<p>Think of them as the four vital signs of your infrastructure.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Signal 1: Latency \u2014 How Long Are Things Taking?<\/h2>\n\n\n\n<p><strong>Latency<\/strong> is the time your system takes to respond to a request. It&#8217;s the most visible metric from a user&#8217;s perspective: How long does it take to load a page? How long to execute a database query? How long for your API to respond?<\/p>\n\n\n\n<p>Latency is tricky because it&#8217;s not uniform. You have:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Internal latency<\/strong>: Function calls, database queries, cache lookups<\/li>\n\n\n\n<li><strong>External latency<\/strong>: Network calls to third-party services, downstream APIs<\/li>\n\n\n\n<li><strong>Request path latency<\/strong>: The end-to-end time from request arrival to response delivery<\/li>\n<\/ul>\n\n\n\n<p>From an SRE perspective, latency connects directly to <strong>SLOs<\/strong>. If your SLO says &#8220;p99 latency &lt; 200ms,&#8221; you need to monitor that p99 continuously. If you&#8217;re consistently hitting 250ms, you&#8217;re burning through your error budget fast\u2014and users notice.<\/p>\n\n\n\n<p><strong>How to start<\/strong>: Pick your critical user journeys (login, search, checkout). Measure their latency at the p50, p95, and p99. Tools like AWS X-Ray, Datadog APM, or New Relic give you this granularity without guessing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Signal 2: Traffic \u2014 How Much Load Are You Handling?<\/h2>\n\n\n\n<p><strong>Traffic<\/strong> measures the volume flowing through your system: How many requests per second? How many database inserts? How many messages in your queue?<\/p>\n\n\n\n<p>Traffic is your early warning system. A sudden spike in traffic can explain a cascade of failures:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>High latency? Check traffic. Maybe you&#8217;re getting 10x normal load.<\/li>\n\n\n\n<li>Database struggling? Check traffic. Maybe an agent started generating garbage requests.<\/li>\n\n\n\n<li>Memory usage climbing? Check traffic. Maybe someone misconfigured a retry loop.<\/li>\n<\/ul>\n\n\n\n<p>From an SRE perspective, traffic patterns tell you whether you&#8217;re <strong>within capacity<\/strong>. If your system can handle 1,000 req\/s but you&#8217;re now seeing 2,000 req\/s, saturation (your fourth signal) will follow. Understanding traffic lets you <strong>scale before you break<\/strong>.<\/p>\n\n\n\n<p><strong>How to start<\/strong>: Instrument key flows (API requests, database transactions, message publishing). Use rate-based metrics from Prometheus, CloudWatch, or Datadog. Alert on anomalies: &#8220;Request rate jumped 50% in 5 minutes.&#8221;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Signal 3: Errors \u2014 How Many Requests Are Failing?<\/h2>\n\n\n\n<p><strong>Errors<\/strong> are requests that didn&#8217;t succeed. They come in forms:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>HTTP 4xx (client error) and 5xx (server error) responses<\/li>\n\n\n\n<li>Failed database transactions<\/li>\n\n\n\n<li>Exceptions in logs<\/li>\n\n\n\n<li>Timeouts on downstream services<\/li>\n<\/ul>\n\n\n\n<p>Errors are the most direct signal that something is broken. But here&#8217;s the catch: not all errors are visible in HTTP status codes. You need to monitor:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>API error rates (4xx and 5xx)<\/li>\n\n\n\n<li>Application exception rates (check your logs)<\/li>\n\n\n\n<li>Failed transactions in your database<\/li>\n\n\n\n<li>Timeout rates on external service calls<\/li>\n<\/ul>\n\n\n\n<p>From an SRE perspective, error rates feed directly into your <strong>SLI<\/strong> (Service Level Indicator). If your SLO is 99.9% availability and you&#8217;re at 99.5%, you&#8217;re burning error budget. Catching this early means you can act before breaching the SLO.<\/p>\n\n\n\n<p><strong>How to start<\/strong>: Query your logs for error patterns. Set up alerts: &#8220;Error rate &gt; 1% for 5 minutes&#8221; or &#8220;5 errors in 1 minute on the checkout endpoint.&#8221; Use structured logging to make errors searchable.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Signal 4: Saturation \u2014 How Full Is Your System?<\/h2>\n\n\n\n<p><strong>Saturation<\/strong> measures how much of your resources you&#8217;re using: CPU, memory, disk I\/O, database connections, queue depth.<\/p>\n\n\n\n<p>Saturation is the sneakiest signal because it&#8217;s a <em>leading indicator<\/em> of problems. High saturation doesn&#8217;t mean you&#8217;re broken yet\u2014it means you&#8217;re about to be. When CPU hits 90%, latency climbs. When memory swaps, everything slows. When your connection pool maxes out, requests queue.<\/p>\n\n\n\n<p>From an SRE perspective, saturation is tied to <strong>resilience<\/strong>. A system at 100% saturation has zero headroom to handle traffic spikes, failovers, or unexpected load. This violates the SRE principle of <strong>static stability<\/strong>\u2014your system should degrade gracefully under load, not catastrophically.<\/p>\n\n\n\n<p><strong>How to start<\/strong>: Monitor the basics on every instance: CPU, memory, disk usage. Monitor application-level saturation: database connection pools, queue lengths, thread pools. Alert at 70-80% saturation, not 95%.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Putting It Together: Troubleshooting With the Four Signals<\/h2>\n\n\n\n<p>Here&#8217;s how they work together. A user complains: <em>&#8220;The site is slow!&#8221;<\/em><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Check Latency<\/strong>: p99 latency is 800ms (normal is 200ms). \u2713 Confirmed slow.<\/li>\n\n\n\n<li><strong>Check Traffic<\/strong>: Requests per second are 2x normal. \u2713 We&#8217;re under load.<\/li>\n\n\n\n<li><strong>Check Errors<\/strong>: Error rate is 0.1% (normal). \u2713 No widespread failures.<\/li>\n\n\n\n<li><strong>Check Saturation<\/strong>: Database CPU is 85%, memory is 90%. \u2717 The database is saturated.<\/li>\n<\/ol>\n\n\n\n<p><strong>Root cause<\/strong>: Database can&#8217;t keep up with query volume. <strong>Action<\/strong>: Scale the database, optimize slow queries, or add caching.<\/p>\n\n\n\n<p>Without these four signals, you&#8217;re guessing. With them, you have a repeatable troubleshooting process.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why These Four? Why Not More?<\/h2>\n\n\n\n<p>You might think: <em>&#8220;But what about P50 latency? Memory leaks? Network drops?&#8221;<\/em><\/p>\n\n\n\n<p>The Four Golden Signals work because they&#8217;re <strong>universal<\/strong>. Every system has latency, traffic, errors, and saturation. They&#8217;re simple enough that operators can remember them during incidents. And critically\u2014they&#8217;re <strong>correlated<\/strong>. Most problems show up in multiple signals at once.<\/p>\n\n\n\n<p>This aligns with SRE&#8217;s principle of <strong>simplicity<\/strong>: you don&#8217;t need a perfect observability stack to be effective. You need the right metrics, measured well, with clear thresholds.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Getting Started: A Minimal Implementation<\/h2>\n\n\n\n<p>You don&#8217;t need enterprise monitoring tools. Start here:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Latency<\/strong>: Add timing logs (<code>console.time()<\/code>, <code>start = time.now()<\/code>) to critical paths<\/li>\n\n\n\n<li><strong>Traffic<\/strong>: Count requests with a simple counter (<code>requests_total<\/code> in Prometheus)<\/li>\n\n\n\n<li><strong>Errors<\/strong>: Parse logs for errors and count them (<code>errors_total<\/code>)<\/li>\n\n\n\n<li><strong>Saturation<\/strong>: Collect host metrics (CPU, memory) from your OS<\/li>\n<\/ul>\n\n\n\n<p>Set alerts on each signal at sensible thresholds for your system. When an alert fires, check all four before acting.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>The Four Golden Signals aren&#8217;t revolutionary. They&#8217;re not the latest monitoring trend. They&#8217;re durable because they answer the fundamental question of SRE: <em>&#8220;Is my system healthy, and if not, where do I start looking?&#8221;<\/em><\/p>\n\n\n\n<p>From an SRE perspective, these signals connect to everything:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Latency<\/strong> feeds into SLOs and user satisfaction<\/li>\n\n\n\n<li><strong>Traffic<\/strong> determines whether you&#8217;re within capacity<\/li>\n\n\n\n<li><strong>Errors<\/strong> show direct impact on reliability<\/li>\n\n\n\n<li><strong>Saturation<\/strong> reveals resilience and headroom for failure<\/li>\n<\/ul>\n\n\n\n<p>Start measuring these four. Build alerts on them. Practice troubleshooting with them. You&#8217;ll find that 90% of your incidents become solvable with just these four metrics.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>Stay tuned for deeper dives into building monitoring that actually helps you sleep at night.<\/strong><\/p>\n\n\n\n<p>Cheers,<\/p>\n\n\n\n<p>Douglas Mugnos<\/p>\n\n\n\n<p>MUGNOS-IT \ud83d\ude80<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-f66a67d e-flex e-con-boxed e-con e-parent\" data-id=\"f66a67d\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>","protected":false},"excerpt":{"rendered":"<p>What are the &#8220;Four Golden Signals&#8221;? Quality Monitoring You&#8217;re drowning in metrics. Your monitoring dashboard has 47 panels. You&#8217;re getting paged for CPU spikes at 3 AM. And when something actually breaks, you still don&#8217;t know where to start looking. Sound familiar? This is exactly the problem the &#8220;Four Golden Signals&#8221; solve. These four metrics\u2014Latency, [&hellip;]<\/p>","protected":false},"author":3,"featured_media":2750,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-2749","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/mugnos-it.com\/wp-content\/uploads\/2026\/03\/ChatGPT-Image-24-de-mar.-de-2026-09_44_06.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts\/2749","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/comments?post=2749"}],"version-history":[{"count":4,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts\/2749\/revisions"}],"predecessor-version":[{"id":2754,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/posts\/2749\/revisions\/2754"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/media\/2750"}],"wp:attachment":[{"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/media?parent=2749"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/categories?post=2749"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mugnos-it.com\/pt\/wp-json\/wp\/v2\/tags?post=2749"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}