“The Cloud is Down”: Single Points of Failure
The Cloud is down this morning and it’s all over the news. Amazon is having problems in one of their regions and of course it’s the most popular one, US-East in Virginia.
Inevitably pundits all over the web are nodding sagely opining that yes, the Cloud is all very good and well but you see this is the problem: when it goes down you’re toast. This line from an article on Business Insider sums up the general tone best:
“…this episode shows the advantages and pitfalls of ‘the cloud.’ On the plus side, it lets startups scale up their infrastructure much more cheaply and efficiently. On the down side, they have to rely on an outside party for the most crucial part of their service — staying up.”
What… utter… nonsense.
Pick any big product that’s not run on the Cloud, such as Ebay. Imagine it was run out of one location (stay with me). One day the network connections to that location are back-hoed by a utility company fixing a pipe in the street outside. The service is offline for 2-3 days. Uh-oh. Following the earlier reasoning we might say something like:
“…this episode shows the advantages and pitfalls of ‘the data center.’ On the plus side, it lets you run hardware heavily optimized for your application. On the down side, you have to rely on an outside party to build armored cabling that can resist mechanical diggers — or risk going off the air for days at a time.”
No, nobody would say that. The technoscenti would say, quite rightly, that Ebay should have planned for exactly this contingency and had a backup facility somewhere.
The Cloud is no different; reliability requires requires common sense, planning and redundancy just like any other web facility.
So when one cloud is down, why are so many well known sites affected?
First, the mix of very young to more established companies is much different on the Cloud than elsewhere. Younger companies have had less time to develop and implement contingency plans. The lack of continuity planning by newer companies is no different today than it was 10 years ago or shall be in 10 years time.
But mostly it’s because the Cloud has been so darned reliable. It’s way more reliable than 9 out of 10 internal facilities (and I’m probably being charitable there). In some cases it might still be ok to not have a backup even after today’s issues: 4 hours downtime per year is just over 99.95% uptime, which is good enough for many a young startup.
It must be said that as the Cloud actually makes a backup facility much, much cheaper than ever before, today’s downtime was avoidable for most of the affected sites. Deployment technologies such as Puppet, Chef, Capistrano, et al and price effective multi-location database setups (such as nightly backups to a second availability zone or cloud provider) brings a full backup option within the price range of even the most meta of startups.
Reddit demonstrated this rather nicely. By the time we read the Business Week article above the EC2 issues were still on-going yet their site was back up and running albeit with the same issues I experience on Reddit every day.
EDIT: As some commenters have pointed out, you can’t post to Reddit but you can read. Although this sucks it demonstrates you can have a contingency without having your whole infrastructure built out twice.
But I doubt that will stop the same uninformed commentary feeding the usual fear of the unknown that accompanies all new technology. At least everyone agrees the Cloud won’t change that any time soon…
UPDATE (3.31p): Josh Haberman rightly pointed out that even RDS AZ in US East were affected and points us to Amazon encouraging folks to use RDS AZ to avoid just this issue. We’ve updated the article to reflect this. Our point stands, for engineers to consider all likely scenarios when building redundancy and not assume anyone – even Amazon – can provide 100.0% uptime. Thanks Josh.