Cloud Costs Optimization Guide
This article is a good starting point if you want to optimize your cloud costs, be it Microsoft Azure, Google Cloud Platform, Amazon Web Services, or another cloud provider. Unlike the situation with most similar articles, I’m not employed in a cloud or software vendor, so I don’t have the conflict of interests and am not going to promote any specific cloud service or tool. I do provide my own consulting services you might consider using, but the article can also be used as an end-to-end guide for your own efforts in cloud cost optimization.
💡 Understand where costs come from
What do you pay for?
In the cloud you pay for the right to use resources provided by someone else. These resources can take different forms, but generally we are speaking about the following types.
Software and added-value services either provided by the cloud vendor or third parties.
How do you pay for it?
There are different pricing models, but commonly you pay for one of these.
Software licenses being included or not;
Features available or different pricing tier (Basic, Standard, Premium etc);
Conditions of use (SLA, Spot instances, Reserved instances etc).
Also you might pay on different conditions. For example:
Pay as you go;
Upfront monetary commitment plans.
Obviously, consuming less vCPUs, RAM, storage or traffic might let you pay proportionally less or even less than that, but other details are also important. For example you might be able to switch your Windows Server virtual machine from “license included” pricing to “bring your own license” model and from pay-as-you-go to reserved instance and make your bill, say, 4 times smaller without changing the size of the VM.
Fine-tuning your cloud spending starts with gaining solid understanding of how exactly your costs are generated.
🔍 Make the costs visible
The goal here is to literally understand what costs and charges come from where. It’s easier if you start implementing your cloud with this visibility requirement in mind and might be more tedious if you let things drift by themselves for a significant amount of time or if you have a bigger cloud environment.
Understand how the bill is structured
Commonly, you receive an invoice from your cloud provider monthly. This bill has line items that might map to parts of your cloud environment or the terms of your contract with the provider. Please, also check if you have all your recurring costs mentioned in the monthly bill. Some services might have a longer billing period or there might be upfront costs you need to be aware of to judge how much you actually spend on the cloud.
Usually you have a higher-level abstraction that is used to aggregate costs for resources: subscriptions for Azure and projects for GCP. For AWS the closest is account. Consider optimizing your invoice high level structure to make costs control easier, e.g., mapping Azure Subscriptions or GCP Projects to cost centers. With AWS you can use accounts to later consolidate the billing under Organization and use management accounts to oversee the costs.
You might also encounter resource-based charges or some other “service” charges like premium licenses for Azure Active Directory. You will need to research who is responsible for all these “services” costs and maintain your own records on this. A simple spreadsheet will do.
When it comes to “resources”, which will most often form the bulk of your costs, I explain some approaches below.
Figure out who consumes what
Having just a total amount, regardless of how big or small it is, doesn't help in figuring out what money is actually spent on. Good news is that all the major cloud vendors provide ways to view costs based on some common characteristics. Here are the ones you can find pretty much everywhere.
“Charge type” e.g. upfront fee vs monthly consumption based etc;
“Dimensions” like location or resource type already might give you some general idea what you spend money on, but this might be not enough. Suppose, using an AWS example, all you know it’s EC2 and US East and this is the only region you use. To really get control over spending and understand your costs in terms which make sense to your business you need to leverage tags.
Ode to Tags
All major cloud providers also support a mechanic that actually lets you track what you spend on, in your terms. It’s called tags in Azure and AWS and labels in GCP, but it the nutshell they work the same for billing purposes. First you “tag” resources with your own custom tags, second you are able to see your expenditure split by tags. Good tag examples would be:
Environment (e.g. Development, Staging, Production);
Cost center (in business sense);
Task-based resource groups (like all the resources supporting an application).
Of course, you can come up with many more useful tag ideas to both track costs and help you govern your resources using automation or policies.
Use the Tools
Often tools provided by vendors are good and allow you to filter your costs based on the categories above, at least this is true for Azure, AWS, and GCP. Azure has “Cost Management + Billing”, AWS has Billing Console and GCP has Cloud Billing Reports.
Out-of-the-box tools are usually enough and additional tools are often not needed unless you know why you need them exactly.
🔨 Act to minimize the costs
We now know where the money is going, and what we want to minimize. Below, we will cover some actions that can be done to achieve it, but before we dig into the details I want to remind you about the 80/20 rule which is very important for optimization tasks. The problem is that with a range of resources in your cloud environment, there is usually a wide range of actions that can be taken and strategies employed. The rule implies that you usually can get 80% of the results with 20% of effort. So the solution is to figure out what brings the best results for the time and money invested, and do just that.
That’s it. You look around and find out what’s not used at all: unused virtual machines, orphan disks and backups, sometimes entire environments that are no longer needed or are not needed at the moment, and similar. You delete them - you save on them.
Do you really need 4 virtual cores and 16 gigabytes of RAM for a small application? Maybe 800 Mb and a shared core will be enough? Inefficiencies like this commonly happen for two reasons. First, right-sizing of a resource is a tough task to do ahead of time when actual usage patterns are not exactly clear. Seriously, if you never reviewed and resized your resources, do it right now, it will save you a lot of money. Second, environments and configurations drift over time, and sometimes you find yourself with capacity over provisioned for a need that is no longer there. A noteworthy partial case of right sizing is creating or fine-tuning scale down rules for the resources that support auto scaling.
Do you really need the “Premium” tier or “Basic” pricing tier will work? Consuming a tier too expensive for the job is generally harder to spot because it is often not about the amount of resources you consume but about the features you do not necessarily use or can do without. I would recommend going through your detailed bill items and ask yourself if you need the extra features or not.
Changing resource locations
Data centers are different. Because of this, cloud providers might bill different money for exactly the same resource depending on the location. Sometimes this is also a trade-off. Suppose, you can place your virtual machines in a cheaper location, but you will pay more for bandwidth. To come up with the right decisions, you need to do some “napkin math” and experiment with “price calculator” tool of your cloud provider (Azure, GCP, AWS) and also consider other things like latency.
Scheduling & automaton
Do you actually need to run that virtual machine 24/7? How much can you save turning it on and off? If the prospective savings justify it, you can use a script to do it automatically. You can also consider scaling resources vertically and horizontally on a schedule. Speaking about virtual machines specifically, cloud providers also support “spot instances” which can be evicted at any moment of time and are much cheaper. When used together with “restart automation” (there are many out-of-the-box and custom options), this might be an extremely cost-effective solution for a range of workloads like ETL, batch processing, training machine learning models, and similar long-running workloads.
Reserved instances, savings plans and discounts
All major cloud providers offer a better price if you commit for more. If you do need your resources for a prolonged period of time and can anticipate the utilization, this might enable significant savings, sometimes as large as 70% of the “pay as you go” price.
Using another resource type or billing mode
Sometimes you can rehost your application to another resource which is either cheaper or billed differently, e.g., on consumption basis instead of paying for capacity. Say, going from hosted Kubernetes (AKS, GKE, EKS) to a serverless offering supporting containers (Azure Container Apps, Google Cloud Run, AWS Fargate).
If your resources have different usage patterns or just don’t require much, they might be grouped together into a single Platform as a Service offering. Using database resources as an example, for Azure this would be SQL Elastic Pool. This strategy can also be used with Infrastructure as a Service resources.
Minimizing licenses consumed
This is especially important for licenses which are included with cloud services. Sometimes, you might use your own license, sometimes you might avoid purchasing an extra one. Paying close attention to your licensing terms alone can easily result in saving thousands per month. Following up with the previous Azure Active Directory example, you might ask yourself if you need this exact number of, say Azure AD Premium P2 licenses or not. This might save a few dozen dollars a month here and there. In a few years, you might end up with thousands saved.
Optimizing for requirements
YAGNI - You Ain’t Gonna Need It. This is about re-thinking non-functional requirements for your applications and setting priorities. The cookie-cutter approach sure saves time, but do you really need the same resources you use for production in your development environment? Do you need a QA environment at all when you are not actively testing anything? Does this system actually need to be highly available or it was designed as such as a result of blindly following “best practices”? What is the target SLA for your application? Are resources and approaches used justified? Do you really need geographically distributed storage and premium network-level optimization? Understanding what you are actually doing is great to avoid spending too much.
Taking advantage of free tiers
All major providers want you to try different services, so they offer some limited use of selected resources for free. You can consider structuring your resources so that a significant portion fits below the free threshold. This savings strategy is probably the hardest to manage and has limited applicability, and this is why I list it the last.
📉 Keep the costs down
OK, so you reviewed your spending and took measures to minimize it, but how do you make sure it doesn’t go back up? Let’s put this straight. Controlling your costs is a management task and it cannot possibly be fully solved by a tool. You will need to have certain practices in place and dedicate your time to implement them. If you need training on how to do it, you can use the resources cloud providers give you (Azure, GCP, AWS), or ask me for a custom training for your team. Here are things you can do to implement these practices yourself.
Maintain costs visibility
Keeping costs visible requires having certain things in place.
Make costs tracking a part of your formal cloud strategy.
Assign the responsible for parts of your cloud environment. Use permissions to restrict who can spend money and where.
Create resources in well defined scopes.
Make sure resources are properly tagged/labeled, so that the costs could be told apart later.
Set up automated notifications and actions
Some automation is already available, but you can also use your own custom scripts in more sophisticated scenarios.
Use alerts to get notifications about expenditure above a certain threshold. Use multiple alerts triggered at different levels.
Set up budgets to be get alerted and shut down resources in an automated way when possible.
Limit autoscale unless you know exactly what you are doing. This is especially important when using serverless offerings as they are generally designed to scale in an explosive manner and can start to spend massive amounts of funds in a matter of minutes.
Resources are provisioned by people and costs should be reviewed by people. This means that you might need to schedule regular reviews. You need to establish a FinOps process, agree about exact benchmarks and tools, and use them over and over again as a part of your business operations.
💭 What’s next?
So, you minimized the costs and have a plan to keep them down. How can you optimize your cloud costs even further? There are several ways.
Consider migrating to/from on-premises or to another cloud
Sometimes grass is actually greener on the other side. If your expected savings justify the consulting project, you can actually move all or part of your workloads to where it’s cheaper to run them. It’s okay to be multi-cloud. It’s also okay to run something on-premises if it’s cheaper this way. All the major cloud providers support hybrid scenarios.
Rearchitect to unlock new hosting options
Let’s suppose you have a .NET Framework application and you run it in an Azure App Service or on a Windows virtual machine. If you rearchitect it to .NET Core (or 5+), you will be able to host it in a Linux container, using cheaper Kubernetes hosting option or a serverless offering like Azure Container Apps. Additionally, you will be able to host more applications using the same capacity, and rearchitecting a single application like this might easily result in savings of hundreds or even thousands of dollars per month.
Optimize your applications
Sometimes you can refactor your application to lower the usage of
Backend or database resources.
Apart of just fixing inefficiencies, you do can:
Eliminate same work being done in multiple places;
Use more effective protocols;
Change your containers’ base image to smaller and more efficient;
Automate managing capacity beyond out-of-the box
Some resources are charged on usage basis but don’t have sufficient automated downscaling mechanics provided by the vendor. Think about things like shrinking capacity during off-hours for PaaS databases where possible and using spot instances with robust restart mechanics. EKS, AKS and GKE can be used with Spot VMs used as nodes automating their life cycle.
Optimize for other important factors
Your costs don’t necessarily come from the cloud provider bill. You can use more holistic view and try to minimize your operations costs in general.
Optimize for lower maintenance/human labor consumption.
Minimize potential losses due to risks. Take into account the SLAs you have with your own clients and their financial implications.
Minimize the scope of your project.
Replace a part of your environment with SaaS.
Train your personnel FinOps to keep your costs controlled.
Save money on spending too much time optimizing your environment! 😉
Now you know how to make your cloud costs visible, proactively minimize them, and keep them under control in the long run. There is no course of actions good for everyone. You might consider doing everything by yourself, or starting with educating your employees, or use professional services. Regardless of what you decide to do, I wish you good luck in your savings journey. Do let me know if this guide proves to be useful for you.