Posts Tagged Reddit

Database Latency is the Achilles Heel of Cloud Computing

Update: This post was written way back in March 2011 and the cloud support of database latency has moved on since then. Look at What DynamoDB tells us about the future of cloud computing and AWS and high performance commodity for updates.

A general theme in the thousands of comments about the Reddit outage is the blame that is placed firmly at the door of Amazon Web Services EBS (Elastic Block Store), as so eloquently put by ex Reddit employee ketralnis,

Amazon’s EBSs are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across reddit.

Without trying to understand the real reasons behind the Reddit outage and the congregation of pro-Reddit commenters on their blog, I would tend to agree with the tertiary reason for failure being the failure of AWS’s EBS. The primary reason for failure is having such a high dependency on EBS in the first place.

Databases are known not to be able to scale up cheaply and the expensive item for scaling up databases is disk i/o performance – be that throughput, latency, availability or any other measure. So the scaling out of databases comes at the expense of a lot of the goodness of an RDBMS, which is where some of the origins of the NoSQL movement were born. Postgres, as used by Reddit, is essentially a plain ‘ol relational database that has a high dependency on disk i/o in order to perform and is not, architecturally, the best candidate data store for hosting on a cheap commodity storage medium that is EBS.

A discussion of whether or not RDS (Amazon Relational Database Service – mySQL on Amazon) is better that Postgres is meaningless as RDS probably wasn’t available when Reddit made their data store decisions. However, Amazon S3 and Amazon SimpleDB were, and the former has 11 nines durability. The architecturally prudent approach would be to make best use of specialised AWS services and not rely on EBS for such a key part of the architecture. Does anyone really think that they are going to get enough database performance using a traditional database product on top of a storage service instead of expensive, hand tuned dedicated storage?

There may be a demand in future, particularly as the enterprise makes more use of AWS, to have higher performance dedicated disk storage inside an AWS data centre, but until then systems should be architected to store data responsibly and tuned for the particular platform. As architects we do make compromises, but many of the NoSQL databases are failure or latency resilient, by running multiple instances, making a lot of use of memory and so on. We do this because of fear of vendor lock-in or because we think that the alternative data stores are cool or interesting.

When building sites on top of AWS, or Google, or Azure, use the data store that is optimised for and built in to the platform as your first choice in order to get the most responsive and reliable service. A lot of the blame for the Reddit failure rests with the original architects that selected a technology (Postgres) that has such a high reliance on an AWS technology (EBS) that will always under perform. At the time, they may not have forseen the load and may have expected EBS to get faster, but their architecture, in the AWS context, is not very scalable.

This problem will not go away. Highly available and low latency storage will always be expensive, going against the commodity cloud computing ethos. Architects have tough decisions to make and the consequences of those decisions are far-reaching.



Leave a comment

%d bloggers like this: