There was another AWS outage this weekend. Did you hear about it? Your boss probably did. In fact it was the second outage inside of a month and took out some pretty big-name services like Netflix, Instgram and Pinterest … your boss probably heard about that too.
Now it’s the Monday after, and a lot of AWS advocates in IT and ops departments like yours are afraid of getting thrown under the bus. Someone in your management is bound to be a frightening combination of confused and furious about this whole debacle, especially if your company suffered another outage. If you’re someone who’s pushed for the cloud, they’re going to have some questions for you and you’re going to need to have some answers.
Fear not. We’ve been fielding a lot of questions ourselves over the past couple days, and we’ve put together the top five along with some answers that might help quell the fear and put the recent AWS outages in perspective.
Keep in mind, this post is not meant to condone or endorse any particular configuration, and in particular it is not meant as a defense of Amazon. We’re impartial here, people!
First things first, what the heck happened?
The most recent outages were caused by two power-related incidents within weeks of each other that both took out parts of AWS’s US-East-1 region.
As US-East-1 is the original AWS region a lot of stuff was deployed here and never moved to one of the newer ones. So any problem with it impacts a lot of services.
It’s inexcusable, who are they going to fire?
Before you rush to judgment, we wouldn’t be surprised if the first failure resulted in the buildings losing some capability, maybe due to emergency upgrades to prevent a recurrence. It would be ironic if this were true.
But if this isn’t the case, there was either a design error, human error in the operation of the facilities or something so extreme nobody would plan for.
In any case Amazon should start to get a lot more transparent about this problem and their capabilities or risk being seen as another “Roach Motel” (thank-you, Larry).
This never happened before the Cloud! Should we go back?
Actually it did. Almost every company outside of health and nuclear power stations has to have a serious operational issue or three before they are ready to do the serious investment required to develop full resilience to losing a whole data center. Thanks to the cloud companies can grow a lot faster (I’m looking at you Instagram, Pinterest, etc.) before they have their The-Sky-Is-Falling Day and so they are a lot more visible than before.
Why are people still in only one region?
It’s hard to just up sticks and leave, even to another Amazon region. And who says it won’t happen there? Plus many apps are hard to run in multiple locations; it’s easier to have a primary running in one with a backup ready to go in another. But even that takes time, care and adds to the overhead each time you deploy your software. Until these failures happened, many companies eschewed the investment & opportunity risk required to do this work.
Well something must be done! Right?
You can be sure the bigger services are discussing that this morning. Even taking into account the EBS failure in April 2011 there was still a good business case to be made for not making this a number 1 priority, that must now change. It might take some time though, we’d expect folks like Heroku to already be running multi-region if it was trivial or cheap, and they aren’t so it probably isn’t.
So, what do I tell the CEO that we’re doing about it?
You have five choices:
We don’t recommend 1 or 2. We think everyone should work across multiple clouds, it’s just common sense. If you put all your eggs in one basket you’re going to get yolk on your face. Not to mention, any options that end in an exclamation point are probably reactionary.
We think your best choices are options 3 through 5, in order of effectiveness and expense. If you’re doing none of these currently, we recommend you start at #5 and go up.
Whatever you do, don’t do nothing. At the very least, figure out how much your preferred option costs, send it to your CEO and keep their rejection email in your safe for the next time the cloud goes down because you’ll need it to cover your *aas.
7/2 UPDATE – Amazon has released an in-depth post-mortem. Quick version: a generator that was tested recently failed to come up in time. Read the report for more details.