The Two Modes Of FinOps

  • December 12, 2023
There are two modes of FinOps that I like to reference when discussing FinOps as a concept.
The first is the retroactive model. This is easy. Login to your infrastructure backplane and look at your baseline. What do you have? Are you using all the CPU and RAM or can you right-size some instances? No logical changes to the architecture are going to happen here. Just figure out how big everything is supposed to be and how much you're supposed to pay for it. This is the baseline.
The second is the the proactive design of the infrastructure with total cost of ownership in mind. I wanted to put together a few cases where the two models might play off of each other, or transition from one to the other.
For example, the ECS/Fargate vs K8s question. We know that that's a quick and easy, dead simple, cheap way to start running Kubernetes. You can stand up an Elastic Container Registry, stand up ECS Fargate, and launch 10 Docker containers and you're good. But we also know that right-sizing things means that if you look and you're running 50 Fargate tasks, you might be able to limit the amount of CPU and RAM they use. Right-sizing your auto-scaling would work and you would save money. However, one could potentially save a lot more money if you go to a more active solution, which is to get involved in the architecture and its design. Knowing that running more than 24 Fargate jobs, it becomes cheaper to run EKS. And this is not something that any tool is going to be able to tell you.
It's not like you hit 24 workloads and immediately the AWS bill magically becomes unreasonable. But there is a point where that core architectural decision needs to be made.
You have to keep in mind that the total cost of ownership includes labor, but it still will take about the same number of hours for someone to build and maintain the infrastructure.
The more challenging cases are when an environment is running a metric ton of nodeJS or Python code in Lambdas. And there are 5000 of them executing ad-hock. That's where there isn't much potential right-sizing to be done. But there is a point where, if the only thing you're looking at is cost, rethinking your core architecture becomes a reasonable proposition.
Lambda cost calculations are fun. If you generally took the Lambda bill, and you paid for 24 hours, seven days a week of milliseconds worth of lambdas, there's a certain cost for that - at some point, serving that same number of calls from an EC2 instance becomes cheaper. In some cases, significantly so.
There's a little bit of a challenge in terms of that sort of re-architecture because you're going to need somebody to do it. It's a little bit more difficult to split that up. But that conversion point is still there. It just happens a little later. It's not a one-for-one labor cost change.
And then there are examples where the labor cost changes. For instance, if your DBAs are managing a three availability zone, Maria DB Galera cluster - and they've built it to a certain size because they're bouncing Power BI off of it and doing reports. Chances are, they're throwing a lot of money at it to keep your master servers up and running. It might be cheaper for you to switch to a two availability zone cluster on a much smaller instance, but then have one large instance that runs in spot that's read-only just for reporting.
That becomes a lot cheaper if your database cluster size is being driven by an asynchronous reporting need. You can save a lot of money there as well. But a static analysis of the infrastructure via a FinOps tool isn't going to tell you that. It will tell you that you're only using 48% of the CPU, but it won't tell you that you could go down to 20% of the CPU if you offload the reporting workload. And you know the same DBAs that run one cluster can still run one cluster, it just means changing the configuration mode of one of the nodes makes it a read-only replica instead of an active master.
What is necessary to develop a workable FinOps model, whichever way you go, is the architecture and engineering skills to properly put you in the right balance of tools to be the most cost-effective solution. That should always be the first step.
There are some great FinOps tools that do analysis, anomaly detection, right-sizing, and all those things. They're good, but there are cases when you are like a mechanic using a very expensive laptop to flash a Toyota Camry to try to get it to go 200 miles an hour. I'm sure that's possible. There's probably someone out there that makes software that will help you do it. But in reality, go buy a Porsche. Stop trying to modify your Camry.
I look at FinOps as the finishing detail work that gets you the most out of what you've got. Regardless of what you've got as a FinOps tool, we'll figure out how to price it so that you can get the most bang for the buck on that technology. But if you're using FinOps to tighten up a poorly designed infrastructure, you're still going to be wildly overpaying.
If you'd like to hear more of my thoughts on FinOps, check out the podcast I was on at re:Invent 2023 where I discuss the topic with Jon Myer and Steve Robinson from Hyperglance.