My Friday Night With AWS

inopinatus · on July 1, 2012

In most contexts, Disaster Recovery is not the same as High Availability is not the same as Fault Tolerance.

So, in this context, if your devops crew is on the ball, then the first warning in this article:

The only way to ensure close to 100% up time is replicating your entire infrastructure. Infrastructure costs will more than double ...

is mercifully untrue in the majority of cases.

Why? Because unless the major component of your infrastructure cost is storage, or your Recovery Point Objective (RPO) is zero, then database log shipping and bulk data sync to another region isn't all that expensive.

The author may be assuming that you'd need to have the VMs ready to go at the standby region. This isn't true, not when you can boot a large application cluster and promote/upgrade a database replica in minutes. For the majority of businesses, a realistic Recovery Time Objective (RTO) is on the order of minutes to hours, so this is fine.

I built this recently. A booking system for an airline. Works as intended. Failover time is under five minutes. Enabling this is repeatability of deployment, which is an outcome of careful tooling. The application itself was developed by an agile & TDD-centric team which made for an easily transplanted app.

mechanical_fish · on July 1, 2012

This isn't true, not when you can boot a large application cluster and promote/upgrade a database replica in minutes.

This is correct. Unfortunately, in the AWS context those of us who confidently planned to react to trouble in zone US-EAST-X by launching a new cluster in US-EAST-Y have often been frustrated by failures in Amazon's control plane. When one zone goes down hard, instances in other zones generally stay running - though the anecdotal evidence is not perfectly clear - but the ability to spin up new instances or drives in other zones often breaks.

What I don't remember seeing is a case where a failure spanned regions. Which does not mean that can't happen; perhaps a catastrophic latent bug in Amazon's control software could kick it off. But it is far less likely. So planning to spin up in a distant region is a workable plan, and then your point is back to being correct. But spanning regions takes a bit more work and planning.

And, needless to say, everyone's RTO is different. When people talk casually about "100% uptime" I tend to think "30 seconds", in which case redundant running instances is the solution, but obviously you do have to specify it. Because if you can live with 5 minutes, or even 30 minutes, as most apps probably can, your life will be much easier.

inopinatus · on July 1, 2012

Well, er, good, fortunately my point was always in favour of spanning regions, not zones, so I refer you back to my opening note about not conflating disaster recovery (i.e. "we lost the DC, now what?") with high availability ("something within the DC is down, shouldn't affect the application").

Because in that analysis, the AZs never have represented an isolated unit of availability. They are clearly locally interdependent and/or sharing infrastructure. Heck, not sharing a continental plate is pretty much my #1 criteria for a DR replica.

NB: choosing to deploy in us-east just says to me "I want to save bucks, I don't care that it's got by far the worst availability track record".

alanh · on July 1, 2012

While I appreciate anyone taking the time to share their thoughts, I also find it very distracting that nearly every sentence contains some sort of grammatical, orthographic, or structural error.

Does this make me the grammar police, or do I have a valid complaint?

Update. Putting my time where my mouth is: Next time someone has a real time crunch (as Coe notes at the end) but wants to publish a helpful post in a timely manner, contact me with a draft or CMS credentials and I’ll take at least a quick look. Expect no miracles, but I will catch obvious errors.

I also keep wishing I could send pull requests to bloggers with suggested edits.

nothacker · on July 1, 2012

Redundancy wasn't the problem I saw last night. What I saw, at least with Heroku, is that when I checked, the main Heroku site was down and displaying things like nginx errors. That to me is unacceptable for an operation such as theirs. Even if all hell is breaking loose, you don't only keep your status page up for all to see, you have a pretty damn good message up that the main page resolves to. I'm not saying they screwed the pooch entirely as I'm sure they were busy, but, damn it, even Amazon is going to go down sometimes. Screw redundancy if you can't even serve a webpage to inspire confidence that you are working on it. I'm sorry I'm picking on Heroku specifically, because I'd be really f'n surprised if a lot of you weren't in the same boat. You need to have the main page served when that happens, even just a static page that inspires confidence or direct to the blog and provide updates there.

BenjaminCoe · on June 30, 2012

The first indicator that it was going to be a long Friday night was our EC2 hosted Minecraft server tipping over. Nagios alerts followed. This is a brain dump of some of my thoughts about AWS, and a discussion of how we got back on-line quickly.

benatkin · on July 1, 2012

Great post. I'm considering backing up the latest to another blobstore like Rackspace Cloud Files as well, so I can tell people that I don't depend on a single vendor.

talonx · on July 2, 2012

A sober post with good advice compared to most of the rants about the outage that are now on HN.

It's also telling about our need for sensationalism that those rants have more comments than this article!