Multicloud Series: 8 Best Practices for Reducing Cloud Spend – AWS, Google Cloud, and Azure Part III of IV

Make sure to come back every Wednesday for more tech shorts, how-tos, and deep dives into engineering tools and processes.

Digitalization and the need to adapt rapidly to changing market demand have caused a rise in the requirements and expectations that are placed on businesses. Many companies find it challenging to accommodate and adapt to these trends by using existing infrastructure and processes.

At the same time, IT departments find themselves under scrutiny and pressure to improve product performance, improve cost-effectiveness, and meet user demands, making it difficult to justify additional investments to extend and modernize systems and tools.

A hybrid cloud strategy provides a pragmatic solution.

  • By using the public cloud, you can extend the capacity and capabilities of your organization without any upfront investment.
  • By adding more than one cloud (multi/hybrid) to your existing infrastructure, you preserve your existing investment, and increase agility, resilience, security, flexibility & scalability.

A hybrid cloud strategy gives you the flexibility to modernize applications and processes incrementally as your resources allow. But what has happened in the past is that organizations move to use multiple clouds without implementing the necessary management, governance, and automation measures that will lead to cost savings. Another reason for overspending in the cloud is the misconception that you only pay for what you use, but what that actually means is you pay for what you provision. If an organization over-provisions resources with more capacity than they need, or fails to de-provision once it finished using resources, it will continue to pay for what has been provisioned regardless of the resources being used or not.

There are many scenarios where organizations can overspend in the cloud -  we strongly recommend having a senior DevOps, or managed services provider closely monitoring your cloud environments & places where efficiency can be reinforced.

We will go over the 8 most common scenarios and best practices organizations can adopt to reduce spend in the cloud, regardless if its AWS, Azure, or Google cloud.

1. Delete unattached disk storage

When a Virtual Machine (VM) is launched, disk storage is usually attached to act as the local block storage for the application. When you terminate a VM, the disk storage isn’t deleted by default, this is a safety precaution to stop data loss. However, because it is not deleted, it remains active and continues to incur a full-price charge despite the fact that it is no longer used - and this is for both Google Cloud and Azure.

Pro Tip

We recommend deleting the disk storage when it’s been disattached for 2 weeks. At this point it's been “forgotten about”, and it’s unlikely the same storage will be used again. 

2. Delete aged snapshots and images

Many teams use snapshots and images to create a point-in-time recovery point in case of data loss or disaster. The storage of these points on their own isn’t costly but they can quickly get out of control if the number isn’t monitored. The area where it’s easy to lose track and forget is that users can configure settings to automatically schedule snapshots and images on an hourly or daily basis without also scheduling the deletion of older snapshots.

Pro Tip

Set a standard policy for how many snapshots or images should be retained per object and for how long (more than 2 yr old is a good candidate for deletion), and keep in mind that majority of the time, recovery will occur from the most recent snapshot.

3. Terminate Zombie Assets

Zombie assets are infrastructure components that are running in your cloud environment but are not being used for any purpose. They can come in many shapes and sizes like:

  • Storage volumes
  • Aged snapshots
  • Compute infrastructure
  • Databases
  • Disassociated IPs
  • No longer in use VMs & never turned off
  • Zombie VMs - failed VMs during launch or deprovision
  • Idle Load Balancers
  • Idle SQL Databases

As an example, you want to save lots of time for engineers so you create a daily process by loading an anonymized production database into a cloud database for testing and verification in a safe environment. However, while you're helping engineering velocity, the customer never made a plan for cleanup (oh no). So now, each day a new database VM was created, with attached resources, and then abandoned, resulting in a large number of zombie resources.

Regardless of the type of asset and why it was created, you will be charged as long as they are in a running state. They must be isolated, evaluated, and immediately terminated if they no longer serve a purpose.

Pro Tip

Start by identifying VMs that have a max CPU <5% over the past 30 days, this doesn’t always mean it’s a zombie resource, but it's an excellent place to start investigating.

4. Stay up to date on VM generations

Every so often cloud providers release the next generation of VMs or new versions of the existing generations with improved prices for performance or additional functionality, and they usually come with performance improvements that may enable you to run fewer VMs, and reduce costs. 

Microsoft retired Azure Service Manager (ASM) and completely replaced it with Azure Resource Manager (ARM). Any Azure customers who are still using the classic assets (ASM) should mitigate to the ARM to avoid any potential business impacts. 

It’s important to note that you can’t change a generation of a VM after it’s created. If you need to switch or change VM generations you need to uninstall and create a new VM in the new generation. However, with Azure SQL databases and SQL-managed instances, you can select the hardware generation at the time of creation or change the generation later on. 

Pro Tip

Upgrading and staying up to date with VM generation types can save you a good amount of dollars per year.

5. Rightsize infrastructure

Rightsizing is an optimization initiative that has a direct impact on performance and costs. It’s common for engineers to create new VMs that are substantially larger than necessary to either give them some extra headroom, or because they don’t know the performance requirements of the new VM, but without rightsizing resources, costs will begin to increase.

To rightsize you can :

  • Downsize
    This is recommended for underutilized resources that achieve the same core performance, even with a downsized workload.
  • Terminate
    This is recommended for zombie resources, which are assets that are running in your account but are not in use.
  • Upgrade
    Upgrade if your workloads are consistently under high utilization.

It’s important to consider CPU, memory, disk, and network in/out utilization, and to review trended metrics over time. Always use data to guide your decisions around reducing the size of the VM without hurting the performance of the app.

Take for example- memory utilization, network utilization, and/or disk use is above 50% of the provisioned capacity, downsizing a VM to half its current capacity will likely affect workload performance. In this kind of situation, change the VM family from General Purpose to Compute Intensive or Memory Intense, or deploy the workload in a VM Scale Set, which not only helps reduce spending but also increases application resiliency.

Disk storage can be rightsized, factor in capacity, IOPs, and throughput to select the disk size from the standard SSD, HDD, and Premium SSD or Ultra disks.

  • Standard SSD
    Cost-effective option optimized for workloads that need consistent performance at lower IOPs. Good for web servers, lightweight apps, and Dev/test workloads
  • Standard HDD
    Deliver reliable, low-cost disk support for VMs running latency-insensitive workloads. Suitable for backup, non-critical, infrequently accessed workloads.
  • Premium SSD
    Deliver high-performance and low-latency disk support for VMs with IO-intensive workloads. Suitable for production and performance-sensitive workloads.
  • Ultra Disks
    Deliver high throughput, high IOPs, and consistent low latency disk storage. Suitable for data-intensive workloads such as SAP HANA, top-tier databases, and transaction-heavy workloads.

Pro Tip

A good starting place is to look for VMs that have an average CPU < 5% and a max CPU < 20% for 30 days. VMs that fit this criteria are viable candidates for rightsizing or termination. And as a note, Premium storage is billed based on total disk size, regardless of consumption. 

6. Buy reservations

This is an extremely cost-effective technique that can be applied to more than 15 different Azure, Google, and AWS cloud services, including select VMs, storage, and database services.

You can view these here:

Google Cloud
AWS
Azure

With reserved VM instances, you can make 1 to 3 years commitments to a predetermined VM utilization. In return, you get a discount on compute costs compared to pay-as-you-go pricing. Another advantage is that you don't have to pay upfront for the period of time you committed to, with an option to pay monthly and if your business situation changes and you no longer need the reservation, there are options to refund outstanding prepayments.

As a rule of thumb, the size of the reservation should be based on the total amount of compute used by the existing or soon-to-be-deployed database within a specific region and using the same performance tier and hardware generation.

7. Stop and start VMs on a schedule

Providers will bill for a VM as long as it is running, once it’s in a stopped state, there is no charge associated with that VM. As an extreme example but a good one by Vmware, if your VMs are running 24/7, your cloud provider will bill you between 672 to 744 +/- hours per VM. However, if you schedule your VM to shut off between 5pm and 9am on weekdays, and on weekends and holidays, you would save around 488-592 VMs per month. Now, this is an extreme breakdown, and definitely not realistic especially with our flexible work schedules, we can’t just power down VMs outside normal working hours, but outside of production, you’ll likely notice many VMs that don’t need to run 24/7/365. The most cost-efficient environments dynamically stop and start VMs based on a set schedule. Each cluster of VMs can be treated in a different way.

Pro Tip

Set a target for weekly hours that non-production systems should run. 

8. Move object data to lower-cost tiers

Cloud providers offer several tiers of storage at different price points and performance levels. The best cost management practice is to move data between tiers of storage depending on its usage. You can also adjust two things when it comes to storage - redundancy ( how many copies are stored across how many locations) and access tier ( how often the data is accessed). You should be able to mix and match both of these options to create the right mix solution for your business.

  • As an example:
    • Cold locally redundant storage (LRS) is ideal for longer-term storage, backups, recovery
    • Cold geographically redundant storage (GRS) is ideal for archival

Pro Tip

Any objects residing in a hit tier that are older than 30 days should be converted to a cool tier

It’s important to remember that these best practices are not meant to be one-time activities, but ongoing processes. Because of the dynamic and ever-changing nature of the cloud, cost optimization activities should ideally take place continuously. 

Is cloud security on your to-do list? Check out this checklist to help you get started.

Screen Shot 2022-06-23 at 3.22.24 PM

Download now

Leave a Comment