The ghost of DevOps past
I was looking through the list of workers connected to our system one day when I noticed something a little troubling: an IP address that I didn’t recognize. We have an elastic infrastructure here, so it wasn’t too weird, but based on the number of workers connected and running on that box, it still didn’t feel right. Was someone listening in on us?
Dealing with Cloud elasticity
This is a pretty common problem for public cloud users. The elasticity of the public cloud allows you to have just the right number of servers for the load your app is taking— you don’t have to over-buy servers to anticipate more users. However, this ever-changing environment often presents difficulties for Ops teams. Often used to used to controlling the infrastructure, these teams have a hard time adjusting.
Our infrastructure at Cloudability is pretty elastic, too. In fact, applications can control themselves! Developers and apps can control their own scaling patterns, which lets us take full advantage of the Cloud.
So how do we deal with scenarios such as the aforementioned rogue IP, given that our environment is always changing?
It so happens that we have a DevOps tool that I don’t hear a lot of other people talking about: infrastructure logging, which is inspired by application logging. Every state change of our infrastructure is logged in a database that we can reference later. At Cloudability all of this happens automatically and behind the scenes.
Why would we want this? It allows for an audit trail for user actions, functions as an early warning system, and provides a post-mortem. As a result, dealing with this rogue IP scenario actually turned out to be a breeze; identifying the source of the IP simply came down to a SQL query.
So how would you go about building a tool that does this?
Accessing your statistics
To introduce this to your infrastructure, you’ll primarily need two things. First, you need to know how many instances are started or running at any given time. Then, you need to know specific statistics: CPU utilization, IO stats, etc.
The DescribeInstances API endpoint in AWS under EC2 gives you this instance data you need. The CLI is a good way to test this stuff out.
This example just pulls out the instance IDs from the DescribeInstances result set; this gives you a list of all instances you need.
CloudWatch is a system that collects statistics about stuff in Amazon. If you use RDS or ElasticCache, you can find all kinds of stuff in there. GetMetricStatistics is the API endpoint; it retains data for up to two weeks. To access this data, simply give it an instance ID and a metric type, then supply start_at and end_at, and a number of seconds between samples. Usually, data is collected every three minutes.
Now you’re getting instances, and statistics for those instances, every hour. Time to put it in a database!
Storing and using your data
So, the “algorithm” is basically to update your list every hour, and to get each statistic for each instance. There are 8 statistic types per instance, so if you have 10 instances, you’ll need 80 metric calls.
Now that you have the data, what can you do with it?
First, capacity planning. Capacity planning is basically an exercise in looking at what your infrastructure has looked like, and projecting what it will look like in the future. To do that, not only do you have to know what your infrastructure looked like, but you need to know how it performed. You can do this by taking your metric data and comparing it to the metric data of your app.
Next, there’s budgeting and projections. Budgeting very similar to capacity planning, but with a slightly different twist. If you’re an AWS user, you have heard of Reserved Instances, which are one of the key focuses of budgeting for cloud costs. This can be really, really hard without the right information; each RI type, depending on size, AZ, etc, has a different break-even point, and in order to determine which one is most appropriate to your usage, you need to count how many instances you’re running at any given hour.
Finally, this data can help with auditing. Auditing and Root Cause Analysis are a lot more detailed, but knowing specifically how your infrastructure behaved at certain points in time can be immensely helpful in these processes. We use Papertrail to do this, but you can correlate your usage data with whatever is in your system of choice.
As you can see, creating a historical record of your infrastructure can be immensely helpful when it comes to dealing with the elasticity of the Cloud. Don’t miss out on the data that can save you valuable time on planning, budgeting, problem-solving and more; keep that data logged!