Cloudability logo
Registration is Now Open for CloudyCon 2019
Cost Optimization

Managing the cost of big data workloads on AWS

By Leah Weitz on September 19, 2014
PLEASE NOTE: As of December 2nd 2014, there has been a major change to the Reserved Instance model. As a result, some of the information in this blog post may be incomplete. We invite you to read our Reserved Instances 101 for up-to-date information.

----------

As more big data housing and processing workflows move to AWS, the ability to understand and manage the associated costs is becoming increasingly vital. In this blog post, we’ll walk through the most effective strategies for monitoring, controlling, and optimizing the costs of running big data on AWS.

Identify your big data usage pattern, and monitor for disruptions

Managing the costs of your big data on AWS begins with watching them. Depending on your workload structure, your usage—and your costs, accordingly— should look one of three ways: consistent, spiked sporadically, or patterned with spikes at regular intervals.

Big data usage patterns on AWS

If your data analysis is ongoing, such as in fraud detection monitoring or clickstream analysis, your costs are likely going to fall into the first category: stable. This reflects the fact that ongoing analysis requires a similar amount of compute at all times; while it may trend up or down over time, your relatively static usage equates to relatively static costs.

If you run your analyses on an irregular or as-needed basis such as in genomics research, your costs will likely fall into the second category: spiked intermittently. When you run an analysis, your usage will scale up along with your costs; when you finish the analysis, the usage and costs will scale back down.

If you run your analyses at timed intervals, such as every hour, day, or week such as in financial modeling and forecasting or keyword analysis, your costs will likely fall into the third category: regularly spiked. As in the second category, your usage and costs will scale up and down according to when your analyses start and stop.

Identifying these patterns and monitoring for disruptions will allow you to both distinguish which cost management strategies to pursue, and to ensure that you won’t experience any unexpected or inappropriate cost behavior.

Tag your clusters to assign value to workflows

Whether your big data costs are static or spiked, you can drill down into the specific sources of those costs by tagging your clusters. Tags allow you to use an app such as Cloudability to report costs by relevant metrics, such as “cost by service” or “cost by product.” This is useful in that you can assess how much each of your analyses costs and determine its relative value, ensuring that you aren’t running any inordinately expensive workflows— it will also grant you ongoing visibility into the success of your efforts to reduce hourly costs, which we’ll get into more in depth in following sections.

While you can tag any cluster on AWS, tagging works somewhat differently for Elastic MapReduce. When you tag an EMR cluster, the constituent machines of that cluster are automatically tagged with the name of the cluster (“aws:elasticmapreduce:job-flow-id=X”) the machine’s specific role within the cluster (“aws:elasticmapreduce:instance-group-role=X” where X is “master,” “core,” or “task”), and with any additional tags that you apply to the cluster (such as “Environment=Development”). These automated tags make it particularly easy to report down to cost per job, which would require manually applied tags for your other non-EMR workloads. As with your other workloads, you can look at cost per EMR job to assess the value of your workflows, evaluate the outcome of your cost savings strategies, and eliminate any exorbitant cost.

Architect your big data resources efficiently

There are several architectural decisions that can affect your big data costs. Specifically, you want to be sure that you’re using the right resources for the job, and that they can interact with each other efficiently.

Choosing the most cost-effective storage option for your business’s needs is a central component of optimizing your big data infrastructure. EBS has generally been considered the superior option compared to S3 due to its increased, tunable speed and the ability to choose between different types of EBS volumes. However, there are advantages to S3: it’s the less costly option, and the absence of attaching devices, etc to move data around on S3 has rightly earned it a reputation for being the more “convenient” of the two options. If you won’t miss the fine-tunability and extra speed of EBS, S3 may be a perfectly acceptable alternative that can save you some dollars.

Another extremely cost-effective big data storage option is Glacier. Glacier emphasizes cost-efficiency in place of retrieval speed; it takes between three and five hours to retrieve data from Glacier, but it only costs $0.01 per gigabyte to store. For data that doesn’t necessitate immediate retrieval such as backup or archives, opting for Glacier in place of S3 or EBS can make a nice dent in your storage costs.

In addition to optimizing your storage, you can further improve the cost efficiency of your big data infrastructure by turning a critical eye to your data transfer costs. If you’re housing your data in one region and processing it in another, you’ll rack up data transfer costs for moving your data back and forth. Consolidating your S3/EBS and EC2 instances to the same location will eliminate that whole process from your infrastructure—and your bill. To see if your own big data is racking up transfer costs, generate a Cloudability Cost Allocation Report by product and see if you’ve accumulated any data transfer fees.

Use Reserved Instances and Spot Instances to lower your hourly cost

Ongoing big data workloads are a prime candidate for saving with AWS Reserved Instances. Any instance with above 50%-60% percent utilization (depending on the specific instance type) will cost less with a Heavy Reserved Instance than it would on-demand, but if you’re running your data analysis constantly, your utilization rates on those instances is likely closer to 100%. At such a high utilization rate, your savings rate with a Heavy reservation can climb as high as 45% compared to the on-demand price.

At such a steep savings rate, purchasing the sufficient number of Heavy reservations to fully cover your big data workloads can make a serious dent in your bill. To see precisely how many Heavy reservations you’d need to keep the on-demand costs at a minimum, log into Cloudability and try our Reserved Instance Planner.

However, keep in mind that this cost-saving strategy is exclusive to static usage. When you’re running your big data workloads intermittently, at either regular intervals or on an as-needed basis, your usage will tend to look spiky. This sporadism will reduce the effectiveness of Reserved Instances, but you have other saving tools on hand that are specialized for saving on spikes: Spot Instances.

Spot Instances provide additional compute power when needed at a drastically reduced price versus On-Demand. The precise amount of additional compute power provided by Spot Instances will vary depending on their fluctuating price; Spot Instances will shut down when their price fluctuates to exceed your bid, so they should generally be used for interruption-tolerant tasks. For example, you can apply Spot Instances to a Hadoop job in Elastic MapReduce in order to provide additional capacity for finishing the job.

Spot Instances can potentially save you up to 86% versus on-demand, so AWS users with spiky or irregular usage should take serious advantage of this savings opportunity. To analyse patterns in your own usage spikes and determine which tasks would make good candidates for Spot Instances, log into Cloudabiltiy and run a Usage Analytics report.

Introduce unit cost to measure business value

Once you’ve put all of these processes in place, you can assess their impact on your bottom line by mapping the cost of your big data workflows against key performance metrics such as analysis, project, & more. This metric—called unit cost—can be tracked over time to measure how efficiently your big data costs are being managed. That is, even if your big data costs are going up overall, the unit cost for running each of your daily analyses, for example, may be getting lower over time as you fine-tune your cost management strategy.

Keep iterating

As with any cloud service, AWS big data solutions require ongoing management. Whether you’re changing the frequency of your big data workflows, archiving from S3 to Glacier, or just want everything to keep running as planned, it’s important to remain attentive to your usage and accompanying costs to ensure controlled and efficient spending by keeping the following processes in mind:

Managing the cost of your big data workflows on AWS

Revisiting the steps above on a regular basis is key to smooth sailing as you navigate the big data sea. Ready to get started? Log in or sign up for a free 14-day trial of Cloudability Pro to find savings opportunities in your Reserved Instance infrastructure, identify Spot Instance opportunities with a Usage Analytics report, assess your EC2 workflow efficiency with a Cost Allocation report, and more.

Being in the know feels great