Blog

Virtual Block Storage Crashed Your Cloud Again :(

You know it's bad when you start writing an incident report with the words "The first 12 hours." You know you need a stiff drink, possibly a career change, when you follow that up with phrases like "this was going to be a lengthy outage...", "the next 48 hours...", and "as much as 3 days".

That's what happened to huge companies like NetFlix, Heroku, Reddit,Hootsuite, Foursquare, Quora, and Imgur the week of April 21, 2011. Amazon AWS went down for over 80 hours, leaving them and others up a creek without a paddle. The root cause of this cloud-tastrify echoed loud and clear. Heroku said:

rant

Ynet on AWS. Let's hope we don't have to test their limits.

In Israel, more than in most places, no news is good news. Ynet, one of the largest news sites in Israel, recently posted a case study (at the bottom of this article) on handling large loads by moving their notification services to AWS.

"We used EC2, Elastic Load Balancers, and EBS... Us as an enterprise, we need something stable..."

They are contradicting themselves in my opinion. EBS and Elastic Load Balancers (ELB) are the two AWS services which fail the most and fail hardest with multiple downtimes spanning multiple days each.

EBS: Conceptually flawed, prone to cascading failures

rant