Getting scientific about over-provisioned AWS instances
Perhaps the most costly of AWS sins is the pervasive habit of running EC2 instances at suboptimal efficiency. Whether you’re leaving instances alive that were intended for short-term use, running dev/test/staging instances outside of business hours, or over-provisioning in production, the effect on your total bill can be massive.
The goal of this post is to use a more scientific method to focus on identifying over-provisioned instances—but chances are, it’ll uncover other suboptimal patterns as well.
Getting past the basics
One practice that many folks turn to when looking for underworked instances is to set thresholds for determining which instances just aren’t being used all that much—in other words, instances that are “underutilized.” This usually means applying fairly crude metrics across all of your EC2 instances, regardless of what they are being used for or what type of instance they are.
This strategy can work well, especially if you set very low thresholds. Using this methodology you’ll find particularly low hanging fruit, which you can then act on. You can even do this with AWS’ own trusted advisor, which focuses on CPU utilizations of less than 10%. But this metric is probably not that particularly useful for storage optimized instances. It’s also too low a percentage figure if you wanted to generally identify over-provisioned compute optimized instances.
The strategy we’re going to discuss here focuses on identifying improper provisioning across your infrastructure by coming up with some metrics and thresholds that are specific to instance categories and types.
Where CPU matters
There are several instance categories where CPU utilization would be a good measure in identifying a case of over-provisioning. Certainly if you’ve purchased a compute optimized instance you’d want to focus on CPU utilization regarding whether 1) you’ve purchased the right size and 2) the right instance category. General purpose instances have a more balanced resource profile, but again CPU utilization is going to be a key metric so long as the threshold isn’t set too high. Finally, CPU utilization will also be key in identifying underutilized GPU instances.
For these three instance categories, classifying over-provisioned instances as having an average CPU utilization of less than 25% seems to make sense. It means that if you choose to vertically scale down one level then your instance(s) will at most average 50% CPU utilization. For purposes of effective reporting, you might want to also consider these guidelines:
- Don’t report on burstable instance families (t2), as these are designed to typically run at low CPU% percentages for significant stretches of time.
- Focus on very recent data, as you’ll want your reported instances to still be around so they can be actioned.
- Only focus on instances that have run for at least half the time period you are looking at. Looking at a one week period and instances that have run for at least 84 hours works well.
Let’s have a look at a report on some sample AWS usage.
You’ll notice this data is ordered by the instances with the estimated highest spend, which is a great way of prioritizing your potential wins. It’s also important to at some point to include the dimensions Account and InstanceID so that you can track down the precise instances.
Here is an action list that could come from this report.
|There are some C3 instances running in production with very low utilization, of particular note a c3.4xlarge instance running at 5.13%||Investigate what these instances are being used for, through tags or instance names. Consider whether compute optimized makes sense. If so, look to downsize instances.|
|There are some m1.xlarge instances running in production with fairly low utilization.||These instances look like they could be halved in size at least. At worst this would bring them to ~35 average CPU%. If halving in size would significantly affect other pertinent resource factors, such as memory, then look to change instance category. As a side note: look to move to new generation kit (m3 or m4) which will boost performance 50-65%.|
|There are a number of m3.xlarge instances running at near to 0% CPU in production.||These look like they could be massively downscaled. If these instances are in an ASG with multiple instances then scale down horizontally, otherwise scale vertically.|
|There is one dev instance, an m3.xlarge, in the top bunch which is running at 0% CPU and is up every hour of the day.||Look to turn off this instance outside of business hours. Probably also over-provisioned—being dev it can safely be downsized.|
Special note: There is a chance a larger instance has been chosen with the intention of getting superior network performance. Add the total bandwidth dimension to the above report if you think that may be the case. Also keep in mind that some jumps in instance size don’t come with a network improvement expectation. If you do have special bandwidth needs, you might want to enable enhanced networking. This improves CPU performance, and may allow you to drop down an instance size.
Where local disk I/O matters
The storage optimized instance category has two current generation instance families: I2 and D2. The key feature of both of these families is the high IOPS they have with their local ephemeral storage. We can use this metric as the main indicator of whether an instance has been over-provisioned or not.
The nice thing about the CPU% metric is that it’s recorded as a percentage out of the box, which makes it simple to report on—however, this is not the case for Disk I/O. Different instance sizes also have different IOPS ratings, so we are best to tailor a report to each instance type. Start with your most common storage optimized instance – you can work this out with a report like this. For our sample data, this is the i2.8xlarge instance.
Now we can see the Disk I/O figures for an entire week for each instance.
According to Amazon, these instances have a maximum read IOPS of 365k and first write of 315k IOPS. Just to get a sense of scale, this corresponds to 31.5 billion read and 27.2 billion write operations per day. We wouldn’t expect to sustain these kind of numbers, but you can see above that we’re orders of magnitude below this. It’d be a good idea to focus in on one of these instances and see how I/O and CPU% is changing over the day, and whether there are peaks we need to account for. Here is an example instance from above.
Keep in mind, the i2.8xlarge instances are super expensive—and we can see that for many hours of the day, this instance isn’t doing much. Perhaps we can use smaller instances and scale horizontally when workloads increase. It also appears that the bottleneck is CPU% rather than DISK I/O. Theoretically, these boxes can handle 20Mil+ read operations per minute, whereas we don’t even reach that in a 1 hour period. If we can switch these boxes over to c3 compute optimized instances, we’ll be able to achieve this CPU performance for far less money. Perhaps the engineers chose i2 because it has large ephemeral disks. Maybe there are issues with scaling up and down because of the time required to prepare these disk. In any case, it’s worth asking them.
Where memory matters
For memory optimized instances, you’d certainly want to focus on your memory usage when evaluating if you are over-provisioned or not. Unfortunately, this metric is not available within Cloudwatch—nd hence not available immediately within Coudability. However, there are scripts provided by Amazon which you can use to get this data in Cloudwatch which may help. You could also use a very low CPU% threshold to help identify obviously over-provisioned instances.
Focusing on the metrics which are most relevant to your instance types can help you better work out whether you’ve over-provisioned or not. You may discover a number of things. Perhaps you actually purchased within the wrong instance category. Perhaps there is plenty of room to downsize your instances vertically or horizontally. Or perhaps you’ll encounter some other types of waste, such as leaving dev instances running outside of business hours. In any case, there is a good chance you’ll find significant savings.
As always, nothing will be more effective in ensuring proper provisioning than actually knowing what your infrastructure does, being aware of its quirks, and understanding your specific risks. But using the principles described here should give you great insight into what’s going on within your cloud infrastructure and where you can make considerable savings.