I barely noticed the SimpleDB outage that occurred on 13 June and lasted for a couple of hours. The lack of outcry on the interwebs is an indication that not many people use SimpleDB – unlike when EBS fell over in April 2011 and knocked over most of the startup world.
When architecting cloud solutions, you build things up based on assumptions about availability and performance of certain components. Even though we have to design for failure, we have to factor in the likelihood of failure when coming up with solutions. We are quite happy, for example, to assume that our web servers will fall over (or terminate if they are being scaled down) but the load balancer less so and we build accordingly. Can you imagine how much more difficult it would be to build an application where fundamental components, such as the DNS server or load balancer were unreliable?
SimpleDB is one of those components that should be more reliable than anything you can build yourself on EC2. On a recent project we place logging and error information in SimpleDB (and S3), despite having both RDS MySQL and MongoDB available. The argument being that logging to MongoDB or MySQL doesn’t help if the problem is a database connection and besides, because SimpleDB has a restful interface, so the entire set of application components could be down and SimpleDB can be queried. This decision was also made on assumptions about the availability, with no design or operational overhead, of SimpleDB. Does this assumption need to be re-assessed?
The AWS outages are becoming worryingly frequent. Without getting into all the complex statistics of the Mean Time Before Failure of components in series the frequent failure of individual components (EBS in April and SimpleDB in June) means that the more components you use, the more likelihood of failure occurring. So while EBS or SimpleDB may not fail again for a while, if you have a dependency on the next one, whatever that may be, your average time before failure of your application looks a bit on the low side.
I hope that AWS continues their openness in (eventually) communicating their outages, but the frequency is worrying. Is AWS succumbing to their own success and unable to handle the demand? Is there a fundamental flaw with their architecture that is showing cracks under load? Are their processes unable to scale, so they aren’t able to monitor problems in order to put remedial actions in place before failure occurs?
Let’s hope that they can sort things out. AWS is not necessarily any more or less reliable than their competitors and is definitely more reliable that your average enterprise data centre. But bad news is bad news and we don’t want to design for too much failure nor should we be feeding the private cloud trolls.