Static Stability: Is Your Infrastructure Truly Resilient?

Today I want to bring you a reflection that, if you haven’t had it yet, you probably will soon — especially if you work (or want to work) with distributed systems, high availability, or cloud environments.

After all… does using one, two, or even three data centers really make your application more resilient and therefore increase your SLA?

And what if I told you that, in practice, it can actually increase the risk of downtime if it’s not well planned?

Sounds strange, right? But that’s exactly the core idea behind the concept of static stability. I talked about this in a YouTube video some time ago and i saw it was a new concept for a lot of people.


What Is Static Stability?

It’s the ability of your system to keep operating even when part of it fails. And more than that: to keep operating in a stable way, without collapsing the rest of the infrastructure.

This concept comes from “traditional” engineering — like civil or automotive engineering. Imagine a car designed to keep running with three wheels if one blows out. The system (the car) adapts and keeps moving, even if with limitations.

Now think with me:

If I build a car with four wheels, I also create four failure points. And if one tire blows? The car may need to reduce from 100 km/h to 20 km/h — or even stop completely until the tire is replaced.

In that case, wouldn’t a tricycle actually be more resilient? It can go 100 km/h like the car, but with 1/4 less chance of failure, since it has only 3 failure points 👀

Sounds crazy, but this is exactly the kind of logic we need to apply when designing distributed systems. More components don’t always mean more safety.

Now think: can your software system actually do this?


The Classic “Fake Resilient” Mistake

People love to draw that beautiful diagram:

Users → Load Balancer → 3 AZs

And proudly say: “We’re running in three availability zones. We’re resilient!”

Is that really true? Let’s bring it to reality…

Imagine you have 3 data centers (AZ-A, AZ-B, and AZ-C), and each one is operating at 75% of its resource capacity.

(By the way, a quick note: when we talk about an AZ — Availability Zone — we mean an isolated availability zone within a region. In practice, it may contain more than one physical data center, but architecturally you should treat it as a single failure point. After all, they’re geographically close and often share energy, network, or even environmental dependencies. 🌩️)

Going back to the car analogy: in this case, you have a tricycle with 3 possible failure points.

Now what happens if AZ-C goes down? How do you fit the 75% load that was running there into the other two AZs?

Let’s do the math:

  1. AZ-A and AZ-B were already at 75%.
  2. Now each must absorb 37.5% more (half of AZ-C’s 75%).
  3. Result? AZ-A and AZ-B jump to 112.5% utilization.

And then what happens?

💥 Downtime.

🔥 Cascading failures.

And if you also implemented automatic retries “to be resilient,” that’s when chaos really spreads — the system starts drowning in itself.

(Spoiler: I have a nearly 1-hour class just about retry inside onde of my architecture trainings, because if it’s not implemented properly, it can create massive problems.)


But Douglas, can’t we just scale?

Yes… if there’s time.

But what I usually see in real-world architectures is that scaling takes longer than traffic rebalancing between the failed zone and the remaining ones.

If traffic redirected to the other AZs is intense and your autoscaling takes time to react, overload hits first. CPU spikes to 100%, threads lock, queues explode.

You’ve seen this movie before, right?


So what is a statically stable system?

It’s a system where each part (for example, each AZ) has enough capacity to operate on its own if the others fail.

Meaning: if one zone goes down, the remaining ones can absorb the total load without exceeding 100%.

So, if you have 3 data centers, make sure they operate at a maximum of 60%. That way, if one fails, the other two go up to 90% and can start scaling additional resources in the healthy AZs.

That’s static stability.


To conclude

Static stability is just one of many critical resilience concepts — along with load shedding, backpressure, retry strategies, graceful degradation, capacity planning, and proper SLO design.

As SREs, Staff, or senior engineers, we need to deeply understand these principles if we truly want to design and operate reliable systems — not just architectures that look good on diagrams.

Resilience is not about adding more zones or more components. It’s about understanding limits, failure modes, and system behavior under stress.

If you want to go deeper into these topics and truly master modern reliability engineering, stay tuned.

More content coming soon. 🚀

guest
0 Comentários
Mais Velhos
Mais Novos Mais Votados
Inline Feedbacks
Veja todos comentários
0
Gostaria muito de saber sua opinião!x