Posts Tagged Design for Failure
I barely noticed the SimpleDB outage that occurred on 13 June and lasted for a couple of hours. The lack of outcry on the interwebs is an indication that not many people use SimpleDB – unlike when EBS fell over in April 2011 and knocked over most of the startup world.
When architecting cloud solutions, you build things up based on assumptions about availability and performance of certain components. Even though we have to design for failure, we have to factor in the likelihood of failure when coming up with solutions. We are quite happy, for example, to assume that our web servers will fall over (or terminate if they are being scaled down) but the load balancer less so and we build accordingly. Can you imagine how much more difficult it would be to build an application where fundamental components, such as the DNS server or load balancer were unreliable?
SimpleDB is one of those components that should be more reliable than anything you can build yourself on EC2. On a recent project we place logging and error information in SimpleDB (and S3), despite having both RDS MySQL and MongoDB available. The argument being that logging to MongoDB or MySQL doesn’t help if the problem is a database connection and besides, because SimpleDB has a restful interface, so the entire set of application components could be down and SimpleDB can be queried. This decision was also made on assumptions about the availability, with no design or operational overhead, of SimpleDB. Does this assumption need to be re-assessed?
The AWS outages are becoming worryingly frequent. Without getting into all the complex statistics of the Mean Time Before Failure of components in series the frequent failure of individual components (EBS in April and SimpleDB in June) means that the more components you use, the more likelihood of failure occurring. So while EBS or SimpleDB may not fail again for a while, if you have a dependency on the next one, whatever that may be, your average time before failure of your application looks a bit on the low side.
I hope that AWS continues their openness in (eventually) communicating their outages, but the frequency is worrying. Is AWS succumbing to their own success and unable to handle the demand? Is there a fundamental flaw with their architecture that is showing cracks under load? Are their processes unable to scale, so they aren’t able to monitor problems in order to put remedial actions in place before failure occurs?
Let’s hope that they can sort things out. AWS is not necessarily any more or less reliable than their competitors and is definitely more reliable that your average enterprise data centre. But bad news is bad news and we don’t want to design for too much failure nor should we be feeding the private cloud trolls.
The web is alive with stories and tweets about the AWS failure that occurred today and for those websites that run on in the US-East region the downtime is, to some degree, unacceptable. What makes this failure particularly bad is that it spanned multiple Availability Zones (AZs) and the idea of building a solution that spans multiple AZs is that a (complete) failure in one AZ should not affect you application as it is able to fail over to the other.
There is no definitive manual for building highly available solutions, but if there was it would definitely have the ‘Design for Failure’ rule as one of the cardinal rules. So, if you have a business that is solely reliant infrastructure (machines, network, telecoms) being available in a specific region then I suggest you RTFM (the high availability manual that is).
While the AWS evangelists and fanbois are keeping quiet, probably so as not to attract attention, or because they are frantically trying to keep customers happy, now is the time to step up and help people understand what high availability means. If you have done your threat and cost modelling, you may have found that spanning multiple regions to mitigate the risks of an occasional failure is not worth the cost or effort. Perhaps you trust that Ireland (for the EU endpoint) is not going to fall into the sea or have its cable cut – or at least the likelihood is so low that you don’t care. While the AWS outage is probably due to someone doing something stupid and is not nearly as sinister as complete East Coast telecoms failure, you have to factor the failure into your plans.
There is a case for interoperability and the ability to magically move to the best provider as part of design for failure (as the #cloudfoundry tweets a saying), but it comes at a cost. Even once you have a highly interoperable architecture there is still the problem of the cost of running multiple nodes around the world, across multiple suppliers and the huge cost of sharing data between them – something that may blow your business case out of the water. Besides, once you get down to the networking level there are lurking points of failure (the endpoint for the ip address) that are very expensive to bring up to high availability – as DOS (Denial of Service) attacks frequently prove.
As Quora pointed out today
“We’d point fingers, but we wouldn’t be where we are today without EC2.
AWS has enabled Quora, as a startup, to get going in a manner that suits their business case. A business case that can probably handle some outage and probably cannot handle the cost of hosting themselves or even across multiple AWS Zones. In fact, Quora has probably received more positive publicity from this failure than losses from disgruntled users, so there was no point in building something that is designed for the failure of an entire region.
So thanks AWS, for reminding us that when you urge ‘Design for Failure’ that you mean any kind of failure. What we do with that advice is up to us.