Archive for category Availability
The recent outage suffered at Amazon Web Services due to the failure of something-or-other caused by storms in Virginia has created yet another round of discussions about availability in the public cloud.
Update: The report from AWS on the cause and ramifications of the outage is here.
While there has been some of the usual commentary about how this outage reminds us of the risks of public cloud computing, there have been many articles and posts on how AWS customers are simply doing it wrong. The general consensus is that those applications that were down were architected incorrectly and should have been built with geographic redundancy in mind. I fully agree with that as a principle of cloud based architectures and posted as much last year when there was another outage (and also when it was less fashionable to blame the customers).
Yes, you should build for better geographic redundancy if you need higher availability, but the availability of AWS is, quite frankly not acceptable. The AWS SLA promises 99.95% uptime on EC2 and although they may technically be reaching that or giving measly 10% credits, anecdotally I don’t believe that AWS is getting near that in US-East. 99.95% translates to 4.38 hours a year or 22 minutes a month and I don’t believe that they are matching those targets. (If someone from AWS can provide a link with actual figures, I’ll gladly update this post to reflect as much). Using the x-nines measure of availability is all that we have, even if it is a bit meaningless, and by business measures of availability (application must be available when needed) AWS availability falls far short of expectations.
I am all for using geographic replication/redundancy/resilience when you want to build an architecture that pushes 100% on lower availability infrastructure, but it should not be required to overcome infrastructure that has outages for a couple of hours every few weeks or months. While individual AWS fans are defending AWS and pointing fingers at architectures that are not geographically distributed is going to happen, an article on ZDNet calling AWS customers ‘cheapskates’ is a bit unfair to customers. If AWS can’t keep a data centre running when there is a power failure in an area, and can’t remember to keep the generator filled with diesel (or whatever), blaming customers for single building single zone architectures isn’t the answer.
Yes, I know that there are availability zones and applications that spanned availability zones may not have been affected, but building an application where data is distributed across multiple AZs is not trivial either. Also, it seems that quite frequently an outage in one AZ has an impact on the other (overloads the EBS control plane, insufficient capacity on healthy AZ etc), so the multiple AZ approach is a little bit risky too.
Us application developers and architects get that running a highly available data centre is hard, but so is building a geographically distributed application. So we are expected to build these complicated architectures because the infrastructure is less stable than expected? Why should we (and the customers paying us) take on the extra effort and cost just because AWS is unreliable? How about this for an idea — fix AWS? Tear down US East and start again… or something. How is AWS making it easier to build geographically distributed applications? No, white papers aren’t good enough. If you want your customers to wallpaper over AWS cracks, make services available that make geographic distribution easier (data synchronisation services, cross-region health monitoring and autoscaling, pub-sub messaging services, lower data egress costs to AWS data centres).
Regardless of how customers may feel, if you Google ‘AWS outage’ you get way, way to many results in the search. This isn’t good for anybody. It isn’t good for people like me who are fans of the public cloud, it isn’t good for AWS obviously, and it isn’t even good for AWS competitors (who are seen as inferior to AWS). If I see another AWS outage in the next few months, in any region, for any reason I will be seriously fucking pissed off.
A post Getting Real About Distributed System Reliability by Jay Kreps is an interesting post about the perception that distributed systems (and distributed databases) increase reliability because they are horizontally scalable. The reasoning flaw, he points out is ‘is the assumption that failures are independent’.
Failures tend to occur, as is his observation, because of bugs in the software (or in the homogeneous infrastructure) and the addition of redundant nodes does not decrease the likelihood of failure much. We see this continuously with cloud outages – the recent leap day bug that crashed Windows Azure is a good example.
I have been doing some work on availability recently and my first availability influencer is quality, followed by fault tolerance (resilience). Redundancy is relevant at the hardware level and is more relevant for scalability than availability. So yes, to active availability — quality, then resilience, and redundancy near the bottom of the list.
I have also been doing work on cloud operations and was intrigued to see that in his post he highlights that the core difficulty is not architecture or design, but operations. I think that he is downplaying architecture but the ability to operate a complex (distributed) system is a big part of keeping it running. He singles out AWSs DynamoDB,
This is why people should be excited about things like Amazon’s DynamoDB. When DynamoDB was released, the company DataStax that supports and leads development on Cassandra released a feature comparison checklist. The checklist was unfair in many ways (as these kinds of vendor comparisons usually are), but the biggest thing missing in the comparison is that you don’t run DynamoDB, Amazon does. That is a huge, huge difference. Amazon is good at this stuff, and has shown that they can (usually) support massively multi-tenant operations with reasonable SLAs, in practice.
I tend to agree with that. Rolling your own available platform is going to be hard, and providers of cloud services, such as Amazon or Microsoft, have more mature operational processes to keep things available. It also casts a shadow over self operated cloud platforms (such as CloudFoundry) which have all of the bugs and none of the operational chops to ensure that availability is high.
Go and read Jay’s post. It is required reading for people building cloud applications.
Often, in cloud computing, we talk about availability. After all, we use the cloud to build high availability applications, right? When pressed to explain exactly what is meant by availability, people seem to be stuck at an answer. “A system that is not ‘down’, er, I mean ‘Up” is not good enough, but very common. So I had a crack at my own definition, starting of by describing availability outcomes and influencers. Have a look at Defining application availability and let me know what you think.
One of the key concepts in scalability is the ability to allow for service degradation when an application is under load. But service degradation can be difficult to explain (an relate back to the term) and ‘degrade’ has negative connotations.
The networking people overcame the bad press of degradation by calling it ‘traffic shaping’ or ‘packet shaping’. Traffic shaping, as we see it on the edge of the network on our home broadband connections, allows some data packets to be of a lower priority (such online gaming) than others (such as web browsing). The idea is that a saturated network can handle the load by changing the profile or shape of priority traffic. Key to traffic shaping is that most users don’t notice that it is happening.
So along a similar vein I am starting to talk about feature shaping which is the ability for an application, when under load to shape the profile of features that get priority, or to shape the result to be one that is less costly (in terms of resources) to produce. This is best explained by examples.
- A popular post on High Scalability talked about how Farmville degraded services when under load by dropping some of the in game features that required a lot of back end processing — shaping the richness of in-game functionality.
- Email confirmations can be delayed to reduce load. The deferred load can either by the generation of the email itself, or the result of sending the email.
- Encoding of videos on Facebook is not immediate and is shaped by the capacity that is available for encoding. During peak usage, the feature will take longer.
- A different search index that produces less accurate results, but for a lower cost, may be used during heavy load — shaping the search result.
- Real-time analytics for personalised in-page advertising can be switched off when under load — shaping the adverts to those that are more general.
So my quick definition of feature shaping is
- Feature shaping allows some parts of an application degrade their normal performance or accuracy service levels in response to load.
- Feature shaping is not fault tolerance — it is not a mechanism to cope when all hell breaks loose.
- Feature shaping is for exceptional behaviour and features should not be shaped under normal conditions
- Shaped features will be generally unnoticeable to most users. The application seems to behave as expected.
- Feature shaping can be automated or manual.
- Feature shaping can be applied differently to different sets of users at the same time (e.g. registered users don’t get features shaped).
So, does the terminology of feature shaping make sense to you?
On Sunday 7 August an AWS datcentre in Dublin was taken out by the ultimate act of God – a lightning strike that wreaked havoc on power supply. There is something satisfying about lightning being the cause – it is far easier to picture explosions and blue arcs of electricity as a cause for outage than a badly upgraded router.
News spread faster than it should and the enterprise vendor sponsored media spread their usual FUD about the reliability of cloud services, pointing knowingly to the outage in April as an example of the risks of the cloud.
The key aspect about this outage is it only affected one availability zone in the EU region. The others (two) remained operational. We should be happy that not only doesn’t lightning strike in the same place twice, but, more importantly, that AWS has separate power supplies (and lightning conductors) for each AZ.
Those sites that have been affected shouldn’t tweet about the impact that the outage has had on the systems. They should instead keep quiet and hang their heads in shame. The first principle of building available services in AWS is to make use of multiple AZs and a failure in one should not impact the availability of the application.
There is no news here. A datacentre was struck by lightning and applications that were designed for failure happily continued to run in the available AZs. There is no massive cloud failure here, just a reminder of how things should work.
Update: The report of the outage from AWS is here
I barely noticed the SimpleDB outage that occurred on 13 June and lasted for a couple of hours. The lack of outcry on the interwebs is an indication that not many people use SimpleDB – unlike when EBS fell over in April 2011 and knocked over most of the startup world.
When architecting cloud solutions, you build things up based on assumptions about availability and performance of certain components. Even though we have to design for failure, we have to factor in the likelihood of failure when coming up with solutions. We are quite happy, for example, to assume that our web servers will fall over (or terminate if they are being scaled down) but the load balancer less so and we build accordingly. Can you imagine how much more difficult it would be to build an application where fundamental components, such as the DNS server or load balancer were unreliable?
SimpleDB is one of those components that should be more reliable than anything you can build yourself on EC2. On a recent project we place logging and error information in SimpleDB (and S3), despite having both RDS MySQL and MongoDB available. The argument being that logging to MongoDB or MySQL doesn’t help if the problem is a database connection and besides, because SimpleDB has a restful interface, so the entire set of application components could be down and SimpleDB can be queried. This decision was also made on assumptions about the availability, with no design or operational overhead, of SimpleDB. Does this assumption need to be re-assessed?
The AWS outages are becoming worryingly frequent. Without getting into all the complex statistics of the Mean Time Before Failure of components in series the frequent failure of individual components (EBS in April and SimpleDB in June) means that the more components you use, the more likelihood of failure occurring. So while EBS or SimpleDB may not fail again for a while, if you have a dependency on the next one, whatever that may be, your average time before failure of your application looks a bit on the low side.
I hope that AWS continues their openness in (eventually) communicating their outages, but the frequency is worrying. Is AWS succumbing to their own success and unable to handle the demand? Is there a fundamental flaw with their architecture that is showing cracks under load? Are their processes unable to scale, so they aren’t able to monitor problems in order to put remedial actions in place before failure occurs?
Let’s hope that they can sort things out. AWS is not necessarily any more or less reliable than their competitors and is definitely more reliable that your average enterprise data centre. But bad news is bad news and we don’t want to design for too much failure nor should we be feeding the private cloud trolls.
I have just read the detailed and very long Post mortem from Amazon re their outage in the US-East region. It turns out that it was a combination of operator error and software configuration issues that caused the problem. The post mortem details what happened, what they will do to prevent the issue happening again, a couple of promised upgrades ( VPC across multiple AZ’s for starters) , a new architecture centre to help their customers.
The winners here are the AWS customers as the improvements in Availability Zones promised, better communication and a whole slew of new webinars to help architecting for the AWS cloud. All of which you know will be delivered in a timely manner will yet again put AWS even further out in front of the chasing pack.
In addition Amazon took the opportunity with this post mortem to respond to their critics with a description of how EBS actually works, A promise to improve communication, apologies and also explained that everyone right up the chain was involved ( hence the near silence during the outage ).
Humbled but not down, more chasing for the pack …..