In an open retrospective, Eden Shochat, from Aleph VC, asks what they could do better to help the companies they invest in. I really appreciate the openness of the Aleph VC blog and if they are willing to ask what they could do better, I'm willing to make some suggestions.
You know it's bad when you start writing an incident report with the words "The first 12 hours." You know you need a stiff drink, possibly a career change, when you follow that up with phrases like "this was going to be a lengthy outage...", "the next 48 hours...", and "as much as 3 days".
That's what happened to huge companies like NetFlix, Heroku, Reddit,Hootsuite, Foursquare, Quora, and Imgur the week of April 21, 2011. Amazon AWS went down for over 80 hours, leaving them and others up a creek without a paddle. The root cause of this cloud-tastrify echoed loud and clear. Heroku said:
In Israel, more than in most places, no news is good news. Ynet, one of the largest news sites in Israel, recently posted a case study (at the bottom of this article) on handling large loads by moving their notification services to AWS.
"We used EC2, Elastic Load Balancers, and EBS... Us as an enterprise, we need something stable..."
They are contradicting themselves in my opinion. EBS and Elastic Load Balancers (ELB) are the two AWS services which fail the most and fail hardest with multiple downtimes spanning multiple days each.
EBS: Conceptually flawed, prone to cascading failures
You know those really annoying popup messages you get on your phone when you're browsing? They aren't easy to ignore like popups on your desktop. They are really, get in your face, make it hard to see the site you wanted to see, annoying. Well, I confess. I helped make those and I'm really sorry.
It was maybe two years ago and someone came to me with a new gimmick (That's really all of AdTech summarized in one word: Gimmick ). It wasn't the first time. I'd done a lot of work building affiliate marketing programs and ad servers. It was, however, possibly the most evil thing I have ever done and I apologize.
As I originally blogged, I was hoping to use EMC snapshots to perform server-less/network-less backups. EMC provides two main tools for managing snapshots in this type of situation:
EMC Replication Manager
EMC PowerSnap Networker Module
The PowerSnap Module supposedly automates taking snapshots for the purpose of backups, while Replication Manager supposedly provides a much more robust package.
With Replication Manager you might create a policy to take a snapshot every five minutes, keep the last 10, and use those for backups whenever necessary.
To make a long story short, Replication Manager is useless for LUNs with ZFS. According to EMC, this won't change in the near future. PowerSnap also has no support for taking snapshots of LUNs with ZFS on them so basically EMC has no server-less backup offerings for Solaris with ZFS.