The SRE has reliability right there in the name. So is that all it’s about?
For anyone who’s been following this newsletter for a while, you already know my take: the SRE has always had a broader scope than the title suggests. It’s one of the roles with the most comprehensive, generalist vision in engineering — capable of adapting across very different contexts and problems. I jokingly call it the platypus of tech. A strange, hybrid creature that doesn’t fit neatly into any single category.
And now people are asking: should the SRE also be responsible for managing costs? well, “Managing costs” sounds strong when you say it out loud. but optimize costs looks fair, doesn’t it ?
The SRE Already Has the Vision
If you look at SRE principles, the role already covers observability, embrace risk, and a deep understanding of architecture. That comprehensive vision — understanding what’s running, how much it’s consuming, where the risks are and that is exactly what you need to spot cost optimization opportunities.
But here’s where things get confused: If you mention resilience in a meeting someone will immediately picture bigger machines, more instances, more capacity new datacenters… But that’s not how it works. Real resilience comes from automation, scalable applications, and smart design — not from throwing hardware at the problem. And the same effort that makes a system more resilient often makes it leaner. Reducing cost is, in many cases, a natural outcome of doing SRE work well.
Think about it this way: what’s better — an active/active cluster with two powerful servers, or several smaller servers behind an autoscaling group? With autoscaling, you get elasticity. The fleet of small servers grows when demand spikes and shrinks when traffic drops. At 3am on a Sunday, you’re not paying for capacity nobody is using. That’s resilience and cost efficiency coming from the same decision.
What the SRE Can Actually Do
Let me get specific, because this is where it gets practical.
Capacity forecasting is already an SRE practice. Understanding how much compute a service needs today, and how that changes with new demand, is core to the role. That same analysis also tells you if you’re running 40% over-provisioned right now — and paying for it every hour.
From there, it’s a natural step to ask: could we use a cheaper disk for this workload? Do we actually need an Intel x86 instance, or would an AMD give us the same performance at a lower price? Are we on AWS? Graviton (ARM-based) instances are significantly cheaper for the right workloads — and many services run perfectly on them.
The SRE can also work with the team on capacity commitments. When you have visibility into capacity trends, you can confidently commit to a certain level of compute and unlock 20-60% in discounts for 1-3 years. That conversation starts with data the SRE already has.
And then there’s backup/DR/Contingency strategy. Based on your RTO and RPO, what’s the cheapest approach that still meets the requirement? Are you paying for hourly snapshots on a service where 24-hour recovery is perfectly acceptable? Are old backups sitting around with no retention policy? These are questions the SRE is already positioned to answer — they just require being looked at through a cost lens as well.
Of course, the SRE still needs business context to make the right calls. But the technical foundation is already there.
Recently I was at the AWS User Group meetup in Campinas talking about cost reduction. And the part that stuck with me wasn’t what I said — it was the questions. Engineers weren’t asking “isn’t this a FinOps thing?” They were asking “how do I start this conversation at my company?” and “where do I even begin?” That told me everything. SREs already have the mindset and the technical tools to tackle this. What’s missing is the confidence to own it.
This Is Only Getting More Relevant
Since the early days of cloud adoption, cost efficiency has been a growing priority. And now, with AI agents simplifying infrastructure orchestration and teams doing more with fewer people, the question of “how much are we optimizing?” is coming up more and more.
The SRE who can connect reliability decisions to cost outcomes — who can say “this architecture choice improves resilience and reduces the bill” — is the one who gets listened to beyond the incident room.
You don’t need a dedicated FinOps team to get started. Start with your own systems and ask yourself:
- Are you using the right compute, or are you oversized?
- Are there resources provisioned for old projects that were never decommissioned?
- Do you have a retention policy, or is old backup data just accumulating?
- Have you evaluated other processor architecture (intel vs amd vs arm) for your workloads?
- Do you know what your top cost drivers are right now?
- Are you paying for a proprietary database engine where an open-source alternative would do the job?
- Does your workload actually need VMs, or would containers or a serverless solution be a better fit?
- Are you applying efficient elasticity to your resources, or is your infrastructure static regardless of demand?
- Do you have cost alerts configured at different granularities — per service, per resource, usage trends, and budget forecasts?
If you can’t answer those questions confidently, that’s your starting point.
So, What’s My Take?
Cost and resilience aren’t competing priorities — they’re two sides of the same decision. The best infrastructure recovers fast, scales on demand, and doesn’t burn budget when no one is watching. You already have the technical foundation to think about both. All it takes is looking at your systems with that second lens open.
According to the Flexera State of the Cloud Report 2026, 29% of cloud spend is wasted — and after five years of decline, that number went back up, driven by growing cost complexity from AI and new IaaS and PaaS services. That’s not a FinOps problem. That’s an engineering problem. And it’s sitting inside the same systems you’re already responsible for.
Basically… the waste is just there waiting for you to get fixed.
I’d love to hear your take on this. Do you think the SRE should avoid absorbing cost optimization and just hand it off to someone who has no idea how your application actually works? Or does it make more sense to keep that responsibility close to the people who truly understand the system?
Reply and share your thoughts — I read every response.
Cheers,
Douglas Mugnos
MUGNOS-IT 🚀