AWS cost optimisation: 30% savings

Six months ago, I pulled up our AWS Cost Explorer and felt my stomach drop. Our monthly bill had crept up to a figure that would make any CFO nervous - and the worst part was that I could not explain where half of it was going.

Sound familiar?

If you are running workloads on AWS, there is a very good chance you are overspending. According to multiple industry surveys, the average organisation wastes between 25% and 35% of its cloud spend. Not because the cloud is expensive - because it is easy to provision and very easy to forget about. I covered the broader strategic picture in my earlier piece on FinOps and controlling cloud costs - this post goes deeper into the tactical AWS-specific wins.

I manage the IT infrastructure for an e-commerce company supporting over 250 users. Our AWS footprint includes everything from production web servers and databases to development environments, data pipelines, and a handful of legacy services that nobody wants to touch. Over the course of three months, I reduced our monthly AWS bill by 30% without downgrading a single production resource. Here is exactly how.

Start With Visibility: You Cannot Optimise What You Cannot See

The single biggest mistake I see organisations make with cloud costs is not knowing where the money goes. AWS billing can be genuinely opaque - a single line item might aggregate dozens of services, and without proper tagging, you are flying blind.

Cost Allocation Tags

Before touching anything else, I implemented a comprehensive tagging strategy. Every resource got tagged with:

Environment - production, staging, development, sandbox
Team - which team owns this resource
Project - which project or product it supports
CostCentre - maps to our internal accounting codes

AWS lets you activate these as cost allocation tags, which means they appear in Cost Explorer and billing reports. Within a week of enforcing tags, we could see exactly which teams, projects, and environments were driving spend.

The results were immediately revealing. Our development and staging environments were responsible for 28% of our total bill - running 24/7 despite only being used during business hours. That single insight led to our first quick win.

AWS Cost Explorer and Budgets

Cost Explorer is free and surprisingly powerful once you have good tags. I set up:

Monthly budgets per environment and team with alerts at 80% and 100% thresholds
Anomaly detection to flag unexpected spend spikes
Weekly cost reports emailed to team leads

The psychology of this matters as much as the tooling. When engineers can see that their development cluster costs £2,400 per month, they start making different decisions. Visibility drives accountability. If you want to take this further, financial acumen is a critical skill for IT leaders - being able to translate technical decisions into financial outcomes is what separates good IT management from great.

Quick Wins: The Low-Hanging Fruit

Before diving into complex optimisation, I focused on changes that required minimal effort but delivered significant savings.

Scheduling Non-Production Environments

Our development and staging environments ran 24 hours a day, 7 days a week. Our developers work roughly 10 hours a day, 5 days a week. That meant these environments were idle for 70% of the time - and we were paying full price for every idle hour.

I implemented AWS Instance Scheduler to automatically stop non-production EC2 instances and RDS databases outside business hours. The configuration was straightforward - a CloudFormation template that creates Lambda functions to start and stop tagged resources on a schedule.

Savings: 65% reduction in non-production compute costs.

For development databases, I went further and switched to Aurora Serverless v2, which scales to zero when idle. Developers get the same experience when they are working, and we pay nothing when they are not.

Deleting Zombie Resources

Every AWS account has them - resources that were provisioned for a project that finished months ago, or test environments that nobody remembered to tear down. I ran a systematic audit looking for:

Unattached EBS volumes - We found 47 volumes totalling 2.3TB that were not attached to any instance. These were costing us roughly £180 per month for literally nothing.
Old snapshots - Over 200 EBS snapshots older than 90 days, many from instances that no longer existed. Another £120 per month.
Unused Elastic IPs - AWS charges for Elastic IPs that are not associated with a running instance. We had 12 of them.
Idle load balancers - Three ALBs routing zero traffic, left over from decommissioned services.

I wrote a simple Python script using boto3 to identify these resources across all regions. The cleanup took a single afternoon and saved us over £400 per month.

S3 Lifecycle Policies

Our S3 buckets had grown to several terabytes, with no lifecycle policies in place. Application logs, old backups, and historical data were all sitting in S3 Standard - the most expensive storage class.

I implemented tiered lifecycle policies:

After 30 days - Move to S3 Infrequent Access (roughly 45% cheaper)
After 90 days - Move to S3 Glacier Instant Retrieval (roughly 68% cheaper)
After 365 days - Move to S3 Glacier Deep Archive (roughly 95% cheaper)
After 730 days - Delete (with appropriate compliance checks)

For our log buckets specifically, I set a 90-day expiration policy after confirming that our compliance requirements only mandated 60-day retention for most log types.

Savings: roughly 60% reduction in S3 costs.

Right-Sizing: Paying for What You Actually Use

Right-sizing is the process of matching your instance types to your actual workload requirements. It sounds obvious, but it is remarkable how many organisations are running workloads on instances two or three times larger than they need.

Using AWS Compute Optimiser

AWS Compute Optimiser analyses your CloudWatch metrics and recommends optimal instance types. I enabled it across all accounts and gave it two weeks to collect data.

The recommendations were eye-opening. Of our 34 production EC2 instances:

12 were significantly over-provisioned (average CPU utilisation below 10%)
8 could move to a newer generation instance type for better performance at lower cost
3 were using general-purpose instances for memory-intensive workloads and would benefit from memory-optimised types

We implemented the changes gradually over four weeks, monitoring performance at each step. Not a single application experienced degradation - most actually performed better on the correctly sized instances because newer instance generations offer improved networking and storage throughput.

RDS Right-Sizing

Databases are often the most over-provisioned resources because nobody wants to risk database performance. But running a db.r6g.2xlarge for a database that peaks at 15% CPU utilisation is burning money.

I took a more cautious approach with databases:

Enabled Enhanced Monitoring to get per-second granularity
Analysed two months of metrics to identify true peak usage
Tested the smaller instance type in staging under load
Scheduled the production resize during a low-traffic maintenance window

We moved three of our five RDS instances down one size class and converted two from Multi-AZ to Single-AZ (for non-critical reporting databases where brief downtime is acceptable).

Savings: roughly 40% reduction in RDS costs.

Reserved Instances vs Savings Plans: Making the Commitment

Once you have right-sized your workloads and eliminated waste, the next step is committing to your baseline usage in exchange for significant discounts.

Understanding the Options

AWS offers two main commitment mechanisms:

Reserved Instances (RIs) are tied to specific instance types, regions, and operating systems. They offer discounts of up to 72% compared to on-demand pricing but are inflexible - if you change instance types, the reservation may not apply.

Savings Plans are newer and more flexible. Compute Savings Plans apply to any EC2 instance family, size, region, or operating system, as well as Fargate and Lambda usage. They offer discounts of up to 66% and automatically apply to your highest-cost usage first.

What I Chose and Why

For our stable, predictable production workloads, I went with a mix:

Compute Savings Plans (1-year, no upfront) for the majority of our baseline compute. The flexibility was worth the slightly lower discount compared to RIs. When we right-sized instances later, the savings plan automatically adjusted.
EC2 Instance Savings Plans (1-year, partial upfront) for our database servers, which are unlikely to change instance family. The higher discount justified the reduced flexibility.
No 3-year commitments. The technology landscape changes too quickly for us to confidently predict our needs three years out. The difference between 1-year and 3-year savings did not justify the risk for our organisation.

I used the AWS Cost Explorer Savings Plans recommendations to identify the optimal commitment amount. The key is to commit only to your baseline - the minimum usage you are confident about - and cover peaks with on-demand or spot instances.

Savings: roughly 35% reduction in committed compute costs compared to on-demand.

Spot Instances: The Secret Weapon

Spot instances let you use spare AWS capacity at discounts of up to 90% compared to on-demand. The catch is that AWS can reclaim them with two minutes notice. This makes them unsuitable for stateful or time-sensitive workloads but perfect for:

Batch processing and data pipelines - Our nightly ETL jobs run on spot instances. If an instance is reclaimed, the job retries automatically.
CI/CD build agents - Build jobs are inherently idempotent. A reclaimed instance just means a build restarts.
Development environments - Developers can tolerate occasional interruptions in exchange for massive cost savings.
Auto Scaling Groups with mixed instances - We configured our production ASGs to use a mix of on-demand (for the baseline) and spot (for scaling). The ASG automatically replaces reclaimed spot instances.

Handling Interruptions Gracefully

The key to using spot instances successfully is designing for interruption:

Use multiple instance types in your spot fleet. Diversification across instance types and availability zones dramatically reduces the chance of simultaneous reclamation.
Implement graceful shutdown handlers that drain connections and checkpoint work when the two-minute warning arrives.
Use spot placement scores to choose instance types and regions with the lowest interruption rates.

We run roughly 40% of our non-production compute on spot instances with an average interruption rate below 5%.

Savings: roughly 70% reduction in applicable compute costs.

Networking Costs: The Hidden Expense

Data transfer costs are the silent killer of AWS bills. They are easy to overlook because they do not appear as obvious line items until you dig into the detail.

Key Optimisations

VPC Endpoints - We were paying data transfer charges for traffic between our EC2 instances and S3 because it was routing through a NAT Gateway. Adding a VPC Gateway Endpoint for S3 (which is free) eliminated those charges entirely. We did the same for DynamoDB.

NAT Gateway optimisation - NAT Gateways charge per GB of data processed. We reviewed what traffic was flowing through our NAT Gateways and moved several internal service-to-service communications to use private subnets and VPC endpoints instead.

CloudFront for static assets - Serving static assets directly from S3 incurs data transfer charges. Putting CloudFront in front of S3 is often cheaper because CloudFront's per-GB pricing is lower than S3 data transfer pricing, and caching reduces the number of origin requests.

Cross-AZ traffic - AWS charges for data transfer between availability zones. We restructured some of our service communication to prefer same-AZ connections where possible, using AZ-aware service discovery.

Savings: roughly 25% reduction in data transfer costs.

Automation and Governance

Cost optimisation is not a one-time project. Without ongoing governance, costs will creep back up within months. This is where platform engineering practices really shine - building cost guardrails into your developer platform so optimisation happens by default rather than by exception.

What We Automated

Tag compliance - An AWS Config rule that flags any resource missing required tags. Non-compliant resources trigger a notification to the owning team.
Budget alerts - Automated Slack notifications when any team or environment exceeds 80% of their monthly budget.
Zombie resource detection - A weekly Lambda function that identifies unattached volumes, unused IPs, and idle resources, then creates a report.
Savings Plans coverage monitoring - Alerts when our Savings Plans coverage drops below 80%, indicating we should review our commitments.

Monthly Cost Reviews

I run a monthly cost review meeting with engineering leads. We review:

Total spend vs budget
Cost per environment and team
Top cost drivers and trends
Recommendations from Compute Optimiser
Any new services or projects that need cost planning

This 30-minute meeting has become one of the most valuable in our calendar. It keeps costs visible and ensures that cost-consciousness is part of our engineering culture, not just an ops concern.

The Results

After three months of systematic optimisation:

Total AWS bill reduced by 30% - A significant six-figure annual saving
No performance degradation - Our P95 latency actually improved slightly due to right-sizing to newer instance generations
Better visibility - Every team knows what they spend and owns their costs
Sustainable governance - Automated checks prevent cost creep

The 30% reduction broke down roughly as follows:

Category	Saving
Non-production scheduling	8%
Zombie resource cleanup	3%
S3 lifecycle policies	4%
Right-sizing	6%
Savings Plans	5%
Spot instances	3%
Networking optimisation	1%

Where to Start

If you are staring at your own AWS bill wondering where to begin, here is my recommended order:

Tags first. You cannot optimise what you cannot see. Implement cost allocation tags before anything else.
Kill zombies. Unattached volumes, old snapshots, idle resources - these are free savings.
Schedule non-production. If your dev environments run 24/7, you are wasting 65-70% of that spend.
Right-size. Enable Compute Optimiser and give it two weeks. Then act on the recommendations.
Commit to your baseline. Once you know your steady-state usage, Savings Plans will knock 30-40% off that cost.
Go spot where you can. Batch jobs, CI/CD, and stateless scaling are ideal candidates.
Automate governance. Make sure your optimisations stick.

The cloud is only expensive if you let it be. With systematic optimisation and a culture of cost awareness, you can run performant, reliable infrastructure at a fraction of the list price. If you need an independent review of your cloud costs or a hands-on optimisation engagement, my technical consulting practice covers cloud architecture, cost strategy, and FinOps.

Daniel Glover is an IT leader specialising in cloud infrastructure, cybersecurity, and technology strategy. He writes about practical IT leadership at danieljamesglover.com.