The Science of Saving with Reserved Instances at re:Invent 2015
But few things can kill a project faster than an empty bank account—which is why at our speaking session, we discussed how to use Reserved Instances to eliminate another blocker: wasteful spending.
In the session recording below, Cloudability Co-Founder and CCO J.R. Storment discusses strategies for using Reserved Instances to optimize spending across a growing infrastructure. Topics include:
- Choosing the right Reserved Instances to buy and modify
- Centralizing and streamlining the buying process
- Tracking key metrics to show ROI
Managing your Reserved Instance portfolio can present challenges—you’ll need the right tools, metrics, and strategies in place to maximize coverage and minimize the risk of your Reserved Instances going to waste. Watch the full talk now and check out the accompanying slides to learn the science of saving with Reserved Instances:
I’m cofounder at Cloudability and today I’m going to be talking about Reserved Instances. So both the science of saving with them, how to be more efficient with your spending, and also the cultural aspect of how to get them working effectively within your company. Cloudability is a platform that basically helps companies who run on AWS and other clouds get better visibility into their spending, be more efficient as they scale, and save money.
It’s through my work at Cloudability and through the team’s work that we’ve had the opportunity to really be along with the ride with the cloud journey with a lot of Amazon’s biggest users; folks like these guys, folks like GE and Autodesk, who, through the process of working with them, through building and growing AWS, we’ve been able to glean a lot of best practices from them, and share a lot of best practices between our customers. That’s really the basis of what we’re going to be talking about today, is working with the $2 Billion of cloud spending that we’ve managed for these types of folks, how we’ve seen them be effective at RIs and sometimes not effective as well.
So quick level setting, show of hands: how many of you have purchased RIs before? Keep your hands up. How many of you have purchased more than one RI? Multiple at once? Dozens? Hundreds? Thousands? One person? I was hoping for one. So that’s a good mix, that’s generally what I’d expect. For those of you who have done a lot of RI purchasing, the story you’re gonna see in here will probably have some familiarity. For those of you who have not done much at all, there’re gonna be some pitfalls you’ll want to pay attention to and hopefully avoid.
The story I wanted to tell you essentially is the story of an RI purchase that went really badly. And given that we’re in Vegas, I thought that I should call this story The (RI) Hangover, because it’s really about a surprising turn of events after our purchase. If anyone hasn’t seen the movie The Hangover, it’s about a bunch of friends who come to Vegas for their bachelor party, and they have this wild night, and they wake up in the morning and they’re surrounded by a bunch of bad things in their hotel room, and they don’t know how they got there, and they spend the movie trying to unravel what happened.
The main character in this story is actually a Fortune 500 company who made a big RI purchase, and you can kind of guess where this is going from the title. Let’s start with how this company–we’ll call this company ACME Corps–came to be buying RIs. This trend here is one you may recognize if you have been growing on cloud for a while. This is the story of their growth in cloud–so the Y axis is essentially how much they’re spending a month, and the X axis is dollars.
So they started at most people do with some Dev/test on cloud, and it was great. It was kind of like, in the movie, the first drink of the night. They felt great, they got a little buzz on, this was awesome. Amazon gets you the servers you want right away, they don’t have to be provisioned, good stuff. After that, they had a little involuntary “load test,” where they essentially spun up a bunch of servers, and forgot to turn them off, and spent a bunch more than they meant to. That wasn’t great, but they got it under control, and moved on from there.
A little bit down the road, they got their first POC in place, and it went well! They were excited; this looked like it was going to be a great thing, so they continued forward. They launched first apps into production. Not long after that, more and more apps followed suit, and this was amazing for them. Suddenly they weren’t bound by long schedules to get hardware—they were launching things as quickly as they needed; they didn’t have to wait for hardware; the cloud was great. It was all thumbs up. Until they finally woke up, one day, with a splitting headache, because this happened.
The exec steam started to really care about the bill. They had crossed this invisible threshold where it really started to matter. This company in particular, ACME Corps, got an email one morning from one of their VPs, and he was furious. He said “this thing is out of control, we’ve got to get this in line–what are you guys going to do about it?”
And they started running around–they had meetings, they had emails going back and forth. The first thing they thought of was: let’s turn off a bunch of servers. It’s the cloud, we’ve got the elasticity component, if we’re not using the servers, we’re going to have to pay for them; we might as well shut them down. They started to look at that and realized it was going to take a lot of work to do the analysis, they were going to have to pull engineering resources and ops resources–they had a bunch of things going on, we don’t have time for that.
They lot let’s get rid of some of this. Let’s get rid of the 8xlarges and go down to 2xlarges. Again, we can’t really put engineering resources into that right now. So they thought: RIs. Because RIs are great; you don’t need to have any engineering or ops resources touch the infrastructure, you just write a check, and your bill goes down. That seems great, and they’d seem some graphs that look kind of like this. This is from a webinar with did with Adobe and Amazon on RI buying. Adobe showed that they can save–they had saved–60% off their EC2 bill by buying RIs. In fact they bought 3-year RIs and had a pretty significant savings.
This seemed like a great thing to the company. They said “awesome, let’s go buy RIs” and send a dev ops person to do analysis, and they bought some RIs, and it was done. It became this. High fives all around. “We did it! We got the RIs!” And this wasn’t a small company, this was a fortune 500. They didn’t buy a few RIs. They bought tons of RIs. They bought 7 figures of all upfront RIs. This was going to put a huge dent in their spending, you know, this was going to solve the problem. So they all enjoyed a round of drinks and they went back to business.
The next month, we saw this happen. The bill continued to go up, after the 7 figure purchase. They woke up again the next morning to an email from their VP, even more upset than he was before, about what happened. They just spent all this money, how is it possible that the bill went up? They’re sitting there scratching their heads, going “we remember doing the analysis, we remember placing the order, we remember the high fives…how did we end up here?” So that’s what I want to unravel a bit in this talk–what went wrong, and how those of you who haven’t run into these pitfalls can avoid them as well.
The three big areas–and we’ll dig in in more detail on each of these–the three big things they did were first: they involved the wrong people. They didn’t involved all the stakeholders. Second, they used the wrong data. They didn’t use the billing data, which is what the RIs are actually based upon. And third, they looked at the wrong timeframe. They didn’t look at what they were currently using, they looked at future plans.
I’d love to get another show of hands. Who do you think should be responsible for, or drive, that RI purchase? Engineering? Ops? Product Owners, Business Owners? Finance / Procurement? You’re kind of all right. Everybody’s got a say in this for a lot of different reasons. If you were to simplify this into two groups, specifically you’ve got the first two which fall into a tech org; second two fall into a business org, we’ll call them for a broad term. They all have involvement to play in this process. What happened with these guys, the reason they got stuck is, it was just a single Ops person who went off and did the analysis, and made the purchase.
With RIs there’s definitely a technology decision to make. You need to know what you’re going to use, what you will be running in your infrastructure–but there’s also a financial one. In this case, and actually in general, you need the tech team to commit to basically using a certain amount of a type of instance, at least until the break even point. When you buy RIs, there’s usually a 4-6 month break even point. Could be 2, could be 8 months, depending on the type, but you need to know you could be running things after that point, and ideally long after that point, so you can start saving money.
On the other side of the house, though, the business also has to commit to spending a certain amount of money for that project, and more importantly they need to decide how they want to deploy the capital. So do they want to put up all the money up front, with an all up front RI buy, or do they want to pay over time, with little or no up front? It’s a decision about how they’re going to pay for that rate. In this case, and this wasn’t the biggest mistake, they bought all upfront RIs, and that both increased the cash outlay, which made the financial org actually be very–I wouldn’t say upset is the word–they were surprised when they saw it hit, they didn’t have a heads up on this coming, and also the all upfront RIs, because you pay for the entire thing at once, all the hours are allocated at once, you lose the ability intra-month to find out what you’re wasting. I’ll go into more detail on that later. The key takeaway with that is that you can’t actually see month to month what is being used.
The second area–and this is probably the most important area where they went wrong–is they didn’t use billing data for RI planning. RIs are a billing construct. This may be confusing, but they are not a technology. You’re not buying “an instance” or a specific instance, you’re literally just buying a coupon to get a better rate for an instance. It’s the coupon you’re buying, not the instance. The tech team at ACME Corps new they were running RHEL. They knew the infrastructure cold. The problem was, they had brought their own licenses to the boxes, so Amazon was not billing them for RHEL, Amazon was billing them for plain vanilla Linux. Now, what this meant was, when the bill hit, all these RHEL RIs they bought, which was most of the RIs they purchased, weren’t applied. On top of that, all the vanilla Linux boxes that they were actually running, were running on on-demand rates, which are more expensive than reserved.
The third thing they did was they bought based on future plans, not their actual spending habits. RIs are applied in each and every hour; Amazon cycles through your infrastructure, they look for a box that matches the coupon, and they apply it. They can be changed really rapidly as your infrastructure changes. In this case, the tech team had expected to move from M3s to C3s in a few weeks. “This big change is about to happen, let’s buy the C3s, because when we get there we’ll have the savings in place.” The issue was, these things happen: there was a delay. They were delayed by several months, and that meant during that entire time, all those C3 reservations that they bought weren’t being used. 7 figures, just sitting there. They couldn’t actually report on the waste either because of the RI purchase they made, the all up front.
On top of that, they had all the M3s that were running, running on demand. So you can see how these things stack up.
On the topic of purchasing “as you need it”: the cloud is really built for “just in time purchasing”. If you recognize this graph, it was on the Amazon homepage, I think for a couple years, and it basically outlines why cloud has some advantages over data centers. In the data center world, you need to buy ahead of capacity; “I know I’m going to need this much capacity, so I need to get servers in place months ahead of time” vs. cloud, which you just spin up as you need it.
Everybody gets this intuitively–this is the fourth year of re:Invent, it’s huge, everyone knows about Amazon, this makes sense–but they don’t apply it to RIs as much. RIs are just about the same. You can buy them very rapidly, there’s an API for it. You can provision quickly and you want to follow the process “what are we using right now? Let’s make sure that’s covered” rather than “what are we using next month? Let’s buy those ahead of time,” and they’re just sitting there.
So, this team at ACME Corp realized they’d made a big mistake and realized they needed to make some drastic changes; stakes were too high. So they made some changes, and they really focused around two things. The first was a new process–they totally changed the way they ran the RI buying, and the second was identifying some new players to be involved in the new process.
The process looked something like this: at a very high level, they had a buy–>measure–>learn approach. You might recognize that from agile methodology, the idea is that you iterate through the process. You make changes through the process, you learn, you repeat that cycle. They attached a number of steps that we’ll go through here in a minute.
On the other side, they introduced a new person. They introduced an RI Czar. The RI Czar is responsible for looking at the billing data each month and identifying opportunities to increase Reserved Instance coverage. This sounds really expensive, right? “How can I put someone on this? I need people to run my business, this isn’t something we can focus on” and that’s kind of what happens. Nobody really dedicates people to the RI process, and as a result it doesn’t happen frequently enough at most companies. It happens quarterly, it happens once every six months, it happens once a year–as the infrastructure is moving wildly every day.
We found with this type of role in the companies that implement this: that when they make it somebody’s job–and that’s the key thing: somebody needs to be measured on RI purchasing, and there’re some great metrics we’ll get into in a minute here that you can actually track that, to see “are we making progress, and did the buys that we made actually reduce our bill, or reduce what we would’ve paid?” The math on this actually works pretty well.
Essentially, if you’re spending a lot of money on EC2, and you do RI purchasing properly, you can save 30-60%. If you’re spending a million or more dollars a year, and there are a lot of people at this conference spending a million a month–that can add up to some serious dollars. So there’s a pretty good business case maybe not to make this a full time person, (although I know of companies who have 2 or 3 people who are dedicated to this), but at least make it one of somebody’s primary objectives that they’re focused on.
The interesting thing here I think is the third point, which is that it’s usually not someone at the tech org. At the skilled companies who are purchasing thousands of RIs, who are running 7 figures a month of infrastructure, tens of thousands of instances, it’s usually somebody who falls under a finance, procuremet, vendor management oraganization, but a very technically minded person in that group. Maybe they have a light CS background, something like that. The reason for that is again, RIs are not a technology, they’re a billing construct. Specifically they’re a rate. If you go back to the business world, there’s going to be one group of people deciding “what are we going to need for servers?” and another group is going to decide “how are we going to finance those? Do we want to pay up front?”
Your tech team is focused on usage “how do we use less servers, less dynamo,” or whatever it is, and this person is focused on “how are we going to reduce the rate” for those things that they’re using.
Before we get into detail on that 3 step process, I wanted to go over a little bit on the basics of RIs, especially folks who haven’t bought them, so we’re all on the same page. Reservations have two parts. You get a cost savings–the main reason people buy RIs is that they want to save money– and you get a capacity reservation. The capacity reservation says if Amazon has a shortage of a certain type of instance in a certain type of region, you’re gonna get first dibs. It’s not a guarantee, but it’s kind of like the VIP line; you’re going to be at the front of it.
Reservations, again, are a coupon. So in a 31 day month, a reservation is going to give you 744 hourly coupons that can be applied each and every month. You don’t have to do the application yourself; Amazon every hour is looking to see “I’ve got this type of instance, I’ve got this coupon that matches it, let’s apply it; I’ve got this other instance and no coupon, they’re gonna run on demand.” But it happens every hour and it cycles through, and this is important to realize as you’re doing planning, because you can’t look at total instances for the month, or total instances for the day, or count of instances or anything to do instance count; you need to look at the hours. How many types of every instances are running each hour so you can figure out how many coupons you need in each hour.
That’s particularly important if you have lots of accounts because that’s aggregated together. You may have 30 or 40 or 100 accounts, all using different types of instances, that are going to count up to see how many different reservations you need. These are the things that make up a reservation: you have an instance family (m1, r3, m3), you have a size within that, you have an operating system–note Linux and RHEL, etc–and you have a term. Most people run 1 year RI terms. If you run the numbers–there are actually great blog posts on our website about this–the 3 year reservations actually generally are the best deal. There have been very few cases, I think, ever–there’s never been a case when the RI in the data we ran where the 3 year wasn’t the better savings rate.
So, don’t be afraid of them, they’re actually something to consider. Adobe said in their webinar, buying in their scale, they’re actually buying 3 year RIs. It’s a good thing to consider. Then you’ve got 3 types: no upfronts, partials, and all upfronts. And again, they’re just different ways to pay for the same thing. A no upfront is kind of like signing a lease. You’re saying “I’m going to use this type of instance for 12 months, but I don’t have to pay anything initially, I’m just going to kind of pay as I go”l the partials are like buying a house with a mortgage, you have a down payment, and then you have a reduced rate, but you keep paying some every month. And then the all upfronts are like paying cash, you’re paying for all 12 or 36 months in one payment.
Again though, because that’s all coming as one chunk and it all hits your DBR at once, you do lose this special line item that shows up for the other two, called the injected line item, that tells you how many hours you’ve used within that month for that type. So the all upfronts are the biggest savings, but you’ll notice that the savings is, in this case, 1.3% different, and that’s pretty common. The all upfronts generally have a 1-2% higher savings rate, so not a big amount of additional savings for the cash outlay.
The interesting thing about all 3, and this is I think changed the beginning of this year, is that all 3 are a commitment to pay for full utilization. Whether or not you use the full RI hours, you’re going to pay for them. In the case of ACME Corp, 7 figures out, paid for it all, they were never used, dollars still stay on the table. Because of that, it’s really important to think of your reservations and your infrastructure as two sides of the same coin. Your infrastructure on one side is going to be changing all the time, ideally. You’ve got autoscaling happening, ideally, you’ve got instances changing size, you’ve got things changing your infrastructure based on the load, the need. And that’s happening every hour, sometimes every 5 minutes. On the other side, you need reservations to keep up with that, keep up through that process.
One last piece on the level setting around RI basics before we dig into the process, and I think this catches up a lot of people in terms of where to put the RIs. There’s a thing called RI affinity that exists between linked accounts, and if you have an account structure–we have multiple accounts at Amazon, consolidated billing–you have something that looks like this. You have a master payer at the top, and all the chargers for the other payers flow up to that master payer from linked accounts. So you can buy RIs at either place, and you get a cost savings and a capacity reservation regardless of where you put it.
But let’s say you bought in the master payer and no instance was running in the master payer that could use that RI coupon. The cost savings is going to flow out to the linked accounts. It will first try to flow out into the account in which the RI was purchased, if not it will go elsewhere. Same thing on linked accounts. If you buy an RI on a linked account, it will first try to apply it there, and then flow out to other accounts to find a match, assuming there is one. Getting into this Buy Measure Learn process here, there are sort of 3 steps you do along the way.
Buy is pretty obvious: you’re going to make a purchase. Measure: you’re looking at ROI. In the learning stage you’re going to align your measurements to your portfolio. The key here is to minimize the time through this loop. Looking at that coin metaphor, where you’ve got the infrastructure on one side, reservations on the other: your infrastructure is moving all over the place. Ideally, you’re adjusting to the right resources for the types of things you’re doing; you want to make sure this process goes quickly.
We’re going to talk about some ways to do that. The metrics you’re going to look at at pretty much every stage of this process are 2 things we like to call the Green line and the Red line. The Green line is basically the percentage of hours within a period that are covered by reservations. So, if you have 100 hours running, 10 of them are covered by reservations, 10% coverage, basically. Red line is the amount of unused, but paid for (because every RI hour every month is paid for) hours. That one is obviously really important. You want to make sure you’re not wasting the dollars you’ve spent.
The Green line looks like this. This is a default one off a Cloudability dashboard. You can lay a bunch of other stuff on it if you want. But the basic premise here is the top line is your total number of hours (in this case, on demand) and the bottom is reservations. So how much of it is running under reservation. So we’re going to be coming back to this one a lot. Generally you want to calculate a reserved utilization or reserved coverage rate. “What’s our percentage?” great way to measure people, if you have a way to measure this. How are they tracking against our goals, how is the coverage rate going up or down.
And then there’s the red line. You guys will probably notice there is no red in this graph. I apologize; badly named. It’s a good meme though. So, the Red line, basically, in this example, is showing month to month, the comparison between our total RI hours we paid for, vs. how many we didn’t use. The tall blue lines are total hours, and the shorter, bluish green lines are unused–the cost of those unused hours. In this example, we’ll see that back in January, we had very little unused hours. February none. March we had a little more, April more, something happened, nobody paid attention, and in the current month, we’re still tracking, when this was taken, they still have that same amount unused.
Using those two metrics, let’s look at the first part of the process: the purchase process. First and foremost, with purchasing, you want to walk before you run. It takes a long time to really get to high coverage without a lot of waste. There are a lot of areas you can get stuck, so I highly recommend your first buy be small and uncontroversial. It’s not a “boil the oceans” type of approach. If you boil the oceans, you’ll probably end up spending more than you need to. So look for things that nobody’s going to be surprised by; “we’re going to be running M3s in US-East-1a.”
You also want to focus your budget on high confidence purchases, so specifically things we know are going to be on all the time, and are going to continue to be on, great place to start. Remember there’s a break even point, so maybe 50% on for an instance, buying something at 52% is probably not a great place to start because you’re not going to save that much. Buying something at 100%, much more savings, so we want to start there.
Tied into that, you want to focus on your highest savings first. It’s not so much about the confidence; it’s more about “I have a bunch of these” or “this is an instance type where there’s a really good discount on RIs”. Some RIs you may only save 20%, some you may save 50%–let’s find the low hanging fruit first.
Within Cloudability, if you’re using this tool to dig in there, you’d basically want to start first looking at which account, because the account really matters where you buy it. So we’d want to ideally plan at the master level, because you want to aggregate together all the potential hours we have to purchase the fewest possible RIs to get the highest coverage. We’d plan at that master level, although we could go into account by account if we wanted. Then we want to pick a date range–key, because you need to buy based on actual usage, not future usage. With the date range, there’s some nuance here. If you have a highly elastic infrastructure or, let’s say, quickly growing infrastructure, it’s really up and to the right–you don’t want to look over the last 3 months, you want to look over the last week, because that’s going to be most representative of now, going forward. Whereas if you’ve got kind of up and down, and it’s pretty stable, you might want to look at a 2 month range. Timing is really key, though, this will drastically affect what you’re looking to buy, modify, and everything else.
Then you’re going to get into what we looked at earlier, those classes of instances. You’re going to buy a i2.xlarge US-east-1a Linux, and we’re going to then look for the lowest hanging fruit, so we want to find, in this case sorted by the biggest savings. So the reservation purchase, single purchase, that saves us the most money. This one looks great, it’ll save us 58%.
Taking this a step further, this is actually a new feature (if anyone has used Cloudability in the past, this was just recently added), you can actually set a usage rate. Let’s say I want super high confidence. Instances that are going to be running all the time, that are going to give you the most savings. 100% Let’s start there. Then maybe next month you can dial it down to 90%–again, things that you’re the most confident about.
On the back of that, you’re going to want to do this, which is confirm the math. It basically compares what you would pay on demand vs. what you would save if you made the RI purchase. The two things I want to call out here, what’s really important to look at if you’re using this tool: you want to see what percentage of the hours are the instances running, and you want to compare that to break even. So in this case, it’s 99%, and the break even is 41%, so that’s a really high confidence purchase. “If we make this purchase, we’re going to make a lot of money back if things continue as they are”
The other things that’s really interesting to look at is the cash flow comparison, which is basically saying, in this example, in about month 5, we’re going to break even. We’re going to start saving money on that. So if we’re really confident this project is going the next 9 months, that’s probably a really good buy to make.
Once we’ve made that small, confident, uncontroversial purchase, we want to loop back into that measurement process. This generally happens, and ACME Corp is running this process monthly where they have a team that comes together to look at this. Once a month, they come together and said “how did we do”. If they made a purchase that was effective, the Green line should go up. All things being equal, same amount of hours, Green line increases.
We also want to dig back into that Red line. In this case, we want to dig a little deeper. Not just the total cost, but we want to get into what specific kinds of instances. We’d want to get deeper into the AZ, OS, all of that. We want to get into what we’re not using, so we can get into strategies for how we want to start using that waste.
You may, though, run into this: “Wait a minute. Last month, we did all the work, we made a good RI buy, and the bill didn’t go down. What happened?” This happened to our friends at ACME Corp, they did tons of work, tons of analysis, everyone signed on, looks good, made the buy, the bill went up significantly the next month. They were a little concerned, obviously. The thing they had a remember and we started to dig in on is that an effective RI purchase drives the hourly rates down, but if you increase your usage dramatically over that time, it’s going to overshadow the savings.
This seems kind of obvious if you think about it. “If I use more, the rate’s going to get overtaken” but it’s hard to explain that to your VP all the time after a purchase. So it’s really good to keep that in mind–you’ve got usage, which affects your ultimate bill, and you’ve got rate which affects your ultimate bill, and they’re very different things and you want to measure them separately.
The way we can measure a rate–what we built for these guys after that purchase, and this is a very in depth report with a bunch of filters–but what we basically said is we’re looking at a specific instance type in a certain region, and a certain operating system, and that was the one we bought. We bought these RIs, and we’re comparing there, on the top right, 2 date ranges, the date before the RI purchases vs. after the purchases. You can see that even though they used 6% more on the left here, the usage hours during that period, hours of that type, their spending actually dropped by 26%. We pulled this report together, we saved a bunch of these for their RI purchases, then they were able to go to the VP and say “look, this purchase, even though our total bill went up, was actually very effective. We should make more of these.”
We figured out, after the buy, we did some measurement, we see some things went up in the Green line, hopefully the Red line is not too big–we want to then start taking actions to adjust our next purchase, and to adjust our previous purchases to be more effective. The first, and probably the most important action–in my opinion this is the most powerful and also simple thing you can do to save money on Amazon–is making RI “modifications”. Anyone making RI modifications? A few people? Great.
Everybody who put their hand up saying they bought RIs should be putting their hand up for that one too. Because RI modifications are free, you can make them as many times as you want, and they basically let you start using the wasted hours that are unusable because they’re the wrong instance size, they’re in the wrong AZ on the on-demand hours you’re actually running and make sure you’re actually saving money on those.
The way you do that is you can modify within a family, which means that if you bought M3s, you bought a 2xlarge, there’s a point system attached, so a 2xlarge is actually 16 points, you can actually split that into two single xlarges, which are each worth 8 points, add up to 16 points. If you look at the Amazon pricing, every size, small, medium, large, is doubling in every case. So you can see how this can trickle down; xlarges go to larges, larges go to mediums, to smalls, and so on and so forth. These also go back the other way; you can take two smalls and make them a medium, two mediums, make them a large.
This really de-risks purchasing within a family. If you know you’re going to be using M3, you don’t need to know what size you’re using to make the RI buy confidently, you just need to know the family.
The other way you can modify is within a region. So if you buy RIs in US-east-1a, you can then modify into 1b, 1c, or 1d. You can’t, however, modifty into 2a, or into west, or another region entirely. It’s just within the number there, so 1, abc, 2, abc, etc.
If you’re using Cloudability, within there there’s going to be a little button at the top that says “let’s do modifications before we do purchases” which is kind of the way you want to do things, because if you have RIs and you’ve already paid for them, you want to make sure you’re using them before you buy more.
If you click that button, you’re going to see a list of recommendations that look like this. On the left side, with just the two rows, we basically have RIs that we own, that we’re not really fully utilizing. They’re being unused, and on the other side, we have instances that are running, that are being used, that we could modify our own RI into. So on the first one, C3, extra large, could be split into additional C3 extra larges in different AZs, and let’s put a 2xlarge in there as well. In the next one we’re going to do a similar thing, move them to different AZs, and combine them together. This is easy to do, you can make these changes and it’s literally 5 or 6 clicks in the Amazon console. Doesn’t require an engineer, doesn’t require anyone with training or certification; you can just log in and make the changes.
The reason that these are impactful, besides the fact that they’re free and easy and all that, is that the savings is twofold. So on the left, you’ve got unused RIs–in this case, this unused one is costing $744 each month, you’re paying for that every month, and on the other side you have unnecessary on-demand charges, that are running that don’t have to be. So this single modification could save $2,800 a month with 6 clicks by anybody with access to the Amazon billing console. So very powerful stuff, and very easy.
You want to look for opportunities as you’re making your RI purchases to increase modification opportunities. Modifications make RI purchasing a lot less risky, and there are some ways you can ensure they happen more often in the future. When you’re doing planning, and again this is a filter in Cloudability, you can build it different wants you want–this one is looking at all our R3 types, also with US-west-2, and then just at RHEL. Can’t change the OS. Everything within this type of list is going to be a much safer buy because it’s interchangeable, you can move it around.
You want to start kind of dialing back and saying “I know you buy RIs for the AZ and I know you buy them for a specific type, but let’s think about our planning process for a specific grouping. Let’s look at the ones that have the most flexibility down the road in order to reduce risk when we buy.”
That’s the basic process. There’s a lot more small fine tuning steps in there, but at a high level that’s what you want to go through. You want to minimize the time through the loop, and also repeat it as often as possible. When we started talking about RIs for the first time, in 2013 at re:Invent, everyone was just kind of buying yearly or not at all, and that was okay, because people weren’t as big as they are now, and they weren’t moving things around as much, but ideally what you want to get into is a monthly process, or better. We know companies that do this weekly, we’ve seen some companies that do modifications and purchases multiple times a week. Monthly is probably fine for you, if you’re not using processes that are extremely elastic.
The process generally looks something like this. You’re going to do modifications, easy to do, get those out of the way. Then you’re going to pull in the recommendations from our tool, to basically say “what should we buy now that we’ve made the modifications?” Have a day or two to confer with the various stakeholders, both on the tech and business side, to confirm that these purchases make sense, then make the purchase.
This schedule obviously can be any day of the month–I do recommend picking a day, picking a period, getting it on the calendar, because it’s (as anybody who has bought RIs has found) it’s really hard to remember to do it frequently enough. You’re worried about keeping your infrastructure up, you’re worried about–if you don’t buy RIs, nothing bad happens, nothing falls over, but next month the bill is going to be higher than it needs to be, so scheduling is important.
In the process, this is how it’s broken down. This is how the company we talked about earlier, ACME Corp, does it. The RI Czar is monitoring billing data, making recommendations for RI buys. Then there’s a Center of Excellence, a COE, a cloud COE team that meets every month, on a scheduled timeframe, that’s cross functional. It includes finance representatives, it includes product / business owners, it includes folks from the tech org. They all get together to review the recommendations, and also look at the efficacy of past purchases. Because you want to bring everybody along for the ride. You may have a month when you buy a bunch of RIs and the bill does not go down. It’s really hard to back all this learning into everybody else, and even, ideally, you can get your VP or whoever’s breathing down your neck on this to be along for the ride on this as well, because everybody’s got to understand how this process works, because they all have a hand in it. So as much as possible, getting everybody into a room is going to be handy.
The tech team, then, is going to have a look at the recommendations to say “okay, you guys are saying we want to spend $100,000 this month on RIs, R3s west in US 1, we’ve just decided to stop using R3s. We’re not not going to buy those.” That’s distinct from “we’re going to start using R3s, let’s go buy them,” we do want to have the tech team say “we’re not going to use R3s, stop buying them” because we want to hit the break even point.
At that point, then, everybody’s agreed on what to buy. Someone in some part of the business org–finance, procurement, vendor management, depending on the organization, has to decide how they want to deploy the money. Do you want to pay it all up front, some up front, or no up front? And then, the same person, ideally each and every month, this RI Czar, executes that buy, and executes the modifications on an ongoing basis. And that’s the process.
A few things to leave you guys with after re:Invent. RIs are frequently misunderstood. I’ve met really smart people who run really great companies who don’t understand how these things work. There’s a lot of detail, there’s a lot of nuance. Take time to educate everyone on your team on the fundamentals. There are tons of talks at re:Invent, there’s a bunch of stuff on our blog, blog.cloudability.com. Learn the basics, get everybody trained up, it’ll make your life easier down the road.
RI coverage changes constantly. Your Red line and Green line, even if you have monthly meetings, are moving all over the place during the month. So, even if you’re not at a place where you can track every day, every week, just start looking at that. Get those reports pulled up, know where you stand, so when you get the bill every month you know if you’ve done better or worse.
Again: modifications: powerful and free. Getting started with those, highly recommend them, they are the easiest way to save money on Amazon without touching your infrastructure.
It’s really easy to get distracted. RIs are not on the top of most people’s mind, get somebody on in place whose job it is, make sure that’s scheduled out, ideally those monthly meetings, well in advance, get all the stakeholders there.
And then manage iteratively. You want to make small, uncontroversial purchases with those future modifications in mind. So, “what can we buy now that’s safe and then later we can modify?”
And, finally, because we’re in Vegas, and it’s been a long week, drink lots of water, please.
Want to learn more about saving with Reserved Instances? Check out our Reserved Instance blog series, then log in or sign up for a free 14-day trial of Cloudability to manage your portfolio with the Reserved Instance Planner today.