Archive for category NoSQL
One of the primary influencers on cloud application architectures is the lack of high performance infrastructure — particularly infrastructure that satisfies the I/O demands of databases. Databases running on public cloud infrastructure have never had access to the custom-build high I/O infrastructure of their on-premise counterparts. This had led to the well known idea that “SQL doesn’t scale” and the rise of distributed databases has been on the back of the performance bottleneck of SQL. Ask any Oracle sales rep and they will tell you that SQL scales very well and will point to an impressive list of references. The truth about SQL scalability is that it should rather be worded as ‘SQL doesn’t scale on commodity infrastructure’. There are enough stories on poor and unreliable performance of EBS backed EC2 instances to lend credibility to that statement.
Given high performance infrastructure, dedicated network backbones, Fusion-IO cards on the bus, silly amounts of RAM, and other tweaks, SQL databases will run very well for most needs. The desire for running databases on commodity hardware comes largely down to cost (with influence of availability). Why run your database on hardware that costs a million dollars, licences that cost about the same and support agreements that cost even more, when you can run it on commodity hardware, with open-source software for a fraction of the cost?
That’s all very fine and well until high performance becomes commodity. When high performance becomes commodity then cloud architectures can, and should, adapt. High performance services such as DynamoDB do change things, but such proprietary APIs won’t be universally accepted. The AWS announcement of the new High I/O EC2 Instance Type, which deals specifically with I/O performance by having 10Gb ethernet and SSD backed storage, makes high(er) performance I/O commodity.
How this impacts cloud application architectures will depend on the markets that use it. AWS talks specifically about the instances being ‘an exceptionally good host for NoSQL databases such as Cassandra and MongoDB’. That may be true, but there are not many applications that need that kind of performance on their distributed NoSQL databases — most run fine (for now) on the existing definition of commodity. I’m more interested to see how this matches up with AWSs enterprise play. When migrating to the cloud, enterprises need good I/O to run their SQL databases (and other legacy software) and these instances at least make it possible to get closer to what is possible in on-premise data centres for commodity prices. That, in turn, makes them ripe for accepting more of the cloud into their architectures.
The immediate architectural significance is small, after all, good cloud architects have assumed that better stuff would become commodity (@swardley’s kittens keep shouting that out), so the idea of being able to do more with less is built in to existing approaches. The medium term market impact will be higher. IaaS competitors will be forced to bring their own high performance I/O plans forward as people start running benchmarks. Existing co-lo hosters are going to see one of their last competitive bastions (offering hand assembled high performance infrastructure) broken and will struggle to differentiate themselves from the competition.
Down with latency! Up with IOPS! Bring on commodity performance!
Beyond the real world behaviour of DynamoDB and its technical comparison to Riak/Cassandra/others, there is something below the technical documentation that gives a clue to the future of cloud computing. AWS is the market leader and their actions indicate what customers are asking for, what is technically possible, and what is a good model for cloud computing. So, here are some of my thoughts on what DynamoDB means in the broader cloud computing market.
Scalability is a service
The most interesting part of DynamoDB has to be the pricing which allows you to pay for the capacity you need (not what you consume). If you want things to run faster (higher throughput), you buy extra units of capacity. This means that the scalability is wrapped up in the service, rather than the infrastructure i.e. it is not as fast/slow for everyone, it is faster for those who pay more.
High performance is commodity
One of the fundamental principles of cloud computing is that the base compute units are commodity devices. In cloud computing there is no option for custom high performance infrastructure that solves particular problems, as we often see with on premise SQL databases. But it is inevitable that these simple commodity units will become higher performing over time and the SSD basis of DynamoDB illustrates this trend. A service with single digit millisecond response times reframes ‘commodity’.
IaaS is dead
I have pointed out before that AWS is not IaaS and every service that they add seems to push them further up the abstraction stack (towards PaaS). If you want low latency in your stack, pay for a (platform) service, not infrastructure. Which leads to mentioning the EBS infrastructure…
Virtualised storage is too generic and slow
It is interesting that the SSD requirement for the barrel-of-laughs that is EBS has been skipped and those SSDs have been allocated to a less infrastructure-oriented storage mechanism. A lot of the ‘AWS sucks’ rhetoric has been because people have a database of sorts on EBS backed storage, suffering the inevitable performance knock – DynamoDB starts pointing clearly towards the architecturally meaningful and technically viable alternative.
Pricing is complex
While consumption based cost generally works out better in the long run, it makes working out and optimising costs really difficult. Unfortunately this makes the cloud computing benefit difficult to understand and articulate as risk averse buyers stick with the ‘cost of hosted machine’ model that they are familiar with. There are now so many dimensions to optimising costs (including the problem presented in DynamoDB of having to change your capacity requirements based on demand) and stable, complete cost models don’t exist – so working out how much things are going to cost over the lifecycle of an application is really hard.
Competitors are stuck
AWS continues to beat the drum on cloud computing innovation and competitors are left languishing. At some point you almost have to stop counting the persistence options available on AWS (EBS, S3, RDS, SimpleDB, DynamoDB, etc.) where competitors have less to offer. Windows Azure Table storage, the Microsoft equivalent key-value store, has barely changed in two years despite desperate pleas for product advancement (secondary indexes, order by).
Vendor lock-in is compelling
As much as there may be a fear of being locked in to the AWS platform, in many cases using DynamoDB is a lot easier than the alternatives. Trying to get Riak setup on AWS (or on premise) to offer the same functionality, performance and ease of use may be so much hassle, and require such specialised skills, that you may be happy to be locked in.
NoSQL gains ground
DynamoDB seems to offer a credible shared state solution that allows for high write throughput, something that SQL is traditionally good at. The option to set a parameter for strongly or eventually consistent reads is a cheeky acknowledgement that CAP theorem bias is your runtime choice. I don’t see DynamoDB replacing RDS, but does add more credibility to, and acceptance of, NoSQL/SQL hybrid models within applications.
The Hacker News community that contributed to the adoption of MongoDB is showing dissent, dismay and desertion of the quintessential rainbows-and-unicorns NoSQL database. The fire was set off last week by an anonymous post ‘Don’t use MongoDB’ and, during the same period, the ‘Failing with MongoDB’ post. These posts triggered all sorts of interesting discussions on the Hacker News threads – some trolling, but mostly from people experienced in mongoDB, scalability, databases, open source and startups. A good sampling of opinion, both technical and not, from people who collectively have a good and valid opinion of MongoDB.
The basis of the trouble is that MongoDB, under certain load conditions, has a tendency to fall over and, crucially, lose data. That begs the question about the quality of the code, the involvement of 10gen and whether or not things will improve over time, or at all. Added to the specific MongoDB concerns, this seems to have cast a broad shadow over NoSQL databases in general.
Below are the links to the relevant posts, with the Hacker News comment thread (in brackets). I urge you to scan through the comment threads as there are some useful nuggets in there.
I have little to add to the overall discussion (there are some detailed and insightful comments in the threads), but would make the following brief observations.
- MongoDB 1.8 and back was unashamedly fast and the compromise was that the performance gain was obtained by being memory based (where commits happen in memory as opposed to disk). It was almost like a persistent cache than a database for primary data.
- If you absolutely have to have solid, sure, consistent, reliable, error free, recoverable, transactioned and similar attributes on your data, then MongoDB is probably not a good choice and it would be safer to go with one of the incumbent SQL RDBMSs.
- Not all data has to be so safe and MongoDB has clear and definite use cases.
- However, unexpected and unexplained data loss is a big deal for any data store, even if it is text files sitting in a directory. MongoDB could ‘throw away’ data in a managed fashion and get away with it (say giving up in replication deadlocks), but for it to happen mysteriously is a big problem.
- Architects assessing and implementing MongoDB should be responsible. Test it to make sure that it works and manage any of the (by now well known) issues around MongoDB.
- Discussions about NoSQL in general should not be thrown in with MongoDB at all. Amazon SimpleDB is also NoSQL, but doesn’t suffer from data loss issues. (It has others, such as latency, but there is no compromise on data loss)
- The big problem that I have with using MongoDB properly is that it is beginning to require knowledge about ‘data modelling’ (whatever that means in a document database context), detailed configuration and understanding about the internals of how it works. NoSQL is supposed to take a lot of that away for you and if you need to worry about that detail, then going old school may be better. In other words, the benefits of using MongoDB over say MySQL have to significantly outweigh the risks of just using MySQL from the beginning.
- Arguably creating databases is hard work and MongoDB is going to run up against problems that Oracle did, and solved, thirty years ago. It will be interesting to see where this lands up – a better positioned lean database (in terms of its use case) or a bloated, high quality one. MongoDB is where MySQL was ten years ago, and I’m keen to see what the community does with it.
I was asked via email to confirm my thoughts on running MongoDB on Windows Azure, specifically the implication that it is not good practice. Things have moved along and my thoughts have evolved, so I thought it may be necessary to update and publish my thoughts.
Firstly, I am a big fan of SQL Azure, and think that the big decision to remove backwards compatibility with SQL Server was a good one that enabled SQL Azure to rid itself of some of the problems with RDBMSs in the cloud. But, as I discussed in Windows Azure has little to offer NoSQL, Microsoft is so big on SQL Azure (for many good reasons) that NoSQL is a second class citizen on Windows Azure. Even Azure Table Storage is lacking in features that have been asked for for years and if it moves forward, it will do so grudgingly and slowly. That means that an Azure architecture that needs the goodness offered by NoSQL products needs to roll in an alternative product into some Azure role of sorts (worker or VM role). (VM Roles don’t fit in well with the Azure PaaS model, but for purposes of this discussion the differences between a worker role and VM role are irrelevant.)
Azure roles are not instances. They are application containers (that happen to have some sort of VM basis) that are suited to stateless application processing – Microsoft refers to them as Windows Azure Compute, which gives a clue that they are primarily to be used for computing, not persistence. In the context of an application container Azure roles are far more unstable than an AWS EC2 instance. This is both by design and a good thing (if what you want is compute resources). All of the good features of Windows Azure, such as automatic patching, failover etc are only possible if the fabric controller can terminate roles whenever it feels like it. (I’m not sure how this termination works, but I imagine that, at least with web roles, there is a process to gracefully terminate the application by stopping the handling of incoming requests and letting the running ones come to an end.) There is no SLA for a Windows Azure compute single instance as there is with an EC2 instance. The SLA clearly states that you need two or more roles to get the 99.95% uptime.
For compute, we guarantee that when you deploy two or more role instances in different fault and upgrade domains your Internet facing roles will have external connectivity at least 99.95% of the time.
On 4 February 2011, Steve Marx from Microsoft asked Roger Jennings to stop publishing his Windows Azure Uptime Report
Please stop posting these. They’re irrelevant and misleading.
To others who read this, in a scale-out platform like Windows Azure, the uptime of any given instance is meaningless. It’s like measuring the availability of a bank by watching one teller and when he takes his breaks.
Think, for a moment, what this means when you run MongoDB in Windows Azure – your MongoDB role is going to be running where the “uptime of any given instance is meaningless”. That makes using a role for persistence really hard. The only way then is to run multiple instances and make sure that the data is on both instances.
Before getting into how this would work on Windows Azure, consider for a moment that MongoDB is unashamedly fast and that speed is gained by committing data to memory instead of disk as the default option. So committing to disk (using ‘safe mode’) or a number of instances (and their disks) goes against some of what MongoDB stands for. The MongoDB api allows you to specify the ‘safe’ option (or “majority” in 2.0, but more about that later) for individual commands. This means that you can fine tune when you are concerned about ensuring that data is saved. So, for important data you can be safe, and in other cases you may be able to put up with occasional data loss.
(Semi) Officially MongoDB supports Windows Azure with the MongoDB Wrapper that is currently an alpha release. In summary, as per the documentation, is as follows:
- It allows running a single MongoDB process (mongod) on an Azure worker role with journaling. It also optionally allows for having a second worker role instance as a warm standby for the case where the current instance either crashes or is being recycled for a normal upgrade.
- MongoDB data and log files are stored in an Azure Blob mounted as a cloud drive.
- MongoDB on Azure is delivered as a Visual Studio 2010 solution with associated source files
There are also some additional screen shots and instructions in the Azure Configuration docs.
What is interesting about this solution is the idea of a ‘warm standby’. I’m not quite sure what that is and how it works, but since ‘warm standby’ generally refers to some sort of log shipping and the role has journaling turned on, I assume that the journals are written from the primary to the secondary instances. How this works with safe mode (and ‘unsafe’ mode) will need to be looked at and I invite anyone who has experience to comment. Also, I am sure that all of this journaling and warm standby has some performance penalty.
It is unfortunate that there is only support for a standalone mode as MongoDB really comes into its own when using replica sets (which is the recommended way of deploying it on AWS). One of the comments on the page suggests that they will have more time to work on supporting replica sets in Windows Azure sometime after the 2.0 release, which was today.
MongoDB 2.0 has some features that would be useful when trying to get it to work on Windows Azure, particularly the Data Centre Awareness “majority” tagging. This means that a write can be tagged to write across the majority of the servers in the replica set. You should be able to, with MongoDB 2.0, run it in multiple worker roles as replicas (not just warm standby) and ensure that if any of those roles were recycled that data would not be lost. There will still be issues of a recycled instance rejoining the replica set that need to be resolved however – and this isn’t easy on AWS either.
I don’t think that any Windows Azure application can get by with SQL Azure alone – there are a lot of scenarios where SQL Azure is not suitable. That leaves Windows Table Storage or some other database engine. Windows Table Storage, while well integrated into the Windows Azure platform, is short on features and cloud be more trouble than it is worth. In terms of other database engines, I am a fan of MongoDB but there are other options (RavenDB, CouchDB) – although they all suffer from the same problem of recycling instances. I imagine that 10Gen will continue to develop their Windows Azure Wrapper and expect that a 2.0 replica set enabled wrapper would be a fairly good option. So at this stage MongoDB should be a safe enough technology bet, but make sure that you use the “safe mode” or “majority” sparingly in order to take advantage of the benefits of MongoDB.
Following my post yesterday about the SimpleDB outage that nobody seemed to notice, I thought a bit about why people aren’t using SimpleDB as much as other AWS services. DISCLAIMER: These are personal observations and not based on extensive research or speaking directly to AWS customers. This list also assumes the people in question have kicked the SQL ACID habit. Why people are still using SQL instead of SimpleDB is a completely separate list.
1. The 1K limit for values is too small for many scenarios. While it makes sense in terms of availability to have limited field value lengths, practically it means that SimpleDB won’t work in cases where there is a piece of data (say a product description even) that is of indeterminate length. And the option of putting that little piece of data in to S3 just makes things harder than they should be.
2. You don’t know how much it is going to cost. In time people will become more familiar with an arbitrary measurement unit (such as SimpleDB Machine Hour) for the cost of utility compute resources, but at the moment it is a bit of a wild guess. It gets even more difficult when you think that the ‘Machine Hours’ that are consumed depend on the particular query and the amount of data – meaning that you need to get all of your data in before you can get an idea of the costs. On the other hand you can quickly get an idea of how much data you can processes using something like MongoDB on an EC2 instance – which is an easier cost to figure out.
3. Backups are not built in. All data has to be backed up and even if SimpleDB data is redundant, you still need backups for updates or deletes that may be made by mistake. Currently you have to roll your own backup or use a third party tool. SimpleDB needs a snapshot incremental backup mechanism that backs up to AWS infrastructure (S3 or EBS). This should be simple, quick, low latency (within the AWS data centre) and offered as part of the AWS toolset.
4. Pain free data dumps are needed. Similar to backup, getting data out of SimpleDB needs to be simple and part of the AWS products. A simple API call or web console click to dump a domain to S3, compressed and in a readable format (say JSON or xml) would go a long way toward people feeling less nervous about putting all their data in SimpleDB.
5. The API libraries are simplistic. MongoDB has really good libraries that handle serialisation from native classes to the MongoDB format (BSON), making it feel like a natural extension to the language. SimpleDB libraries still require that you write your own serializer or code to ‘put attributes’. This makes the friction higher for developers to adopt as there is a whole lot of hand rolled mapping that needs to be done.
6. SimpleDB documentation is anaemic. The SimpleDB query language is SQL-ish, but not quite SQL. I have had to look up and scratch around to try and translate from a SQL statement in my head into the SimpleDB variant – without examples built by AWS themselves. Just describing the API is not good enough, AWS needs a lot more documentation to help people build stuff.
7. There are no big reference sites. Various NoSQL databases are popular because of the sites that are powered by them. You can find good examples for MongoDB, CouchDB, Redis or any other NoSQL platform out there. Where is the SimpleDB flagship? And don’t tell me that Amazon.com is one without saying much about how it is being used.
This says nothing about how good SimpleDB is at its primary functions – availability, durablility, scalability, consistency and all those other important things that you need in data storage. It is simply a list of things that you would expect from a database product that are obviously lacking in SimpleDB and these things that are lacking hinder adoption. It may also not be that obvious to a seasoned user that things are missing as they are long over the architectural consideration and learning curve.
Fortunately there are other storage options on AWS. S3 and RDS are great products and very popular. There are also other databases that can run on EC2 and may migrate into RDS one day. So very few applications need SimpleDB, but they may be missing out on something. I wish AWS would put some finishing touches on their product.
A white paper on “NoSQL and the Windows Azure platform” written by Andrew Brust for the SQL Azure team offers a brief introduction to NoSQL for Microsoft developers. While listing various NoSQL approaches it suggests the following NoSQL technologies for Windows Azure:
- Azure Table Storage
- SQL Azure XML Columns
- SQL Azure Federation
- VM Role
Of those, Azure Table Storage is the only true NoSQL product. Putting documents in XML fields in SQL Azure (SQL Server) is SQL. SQL Azure Federation (automatic sharding ccross SQL Azure databases) may be a killer technology for scaling SQL, but is not available and is still very SQLish. OData is simply a protocol. Running a NoSQL database, such as mongoDB, in a VM Role is just plain silly.
The long-winded conclusion of the white paper highlights,
We have seen how the Azure platform supports a full-on NoSQL approach as well as the ability to implement various NoSQL features on an “a la carte” basis
The Azure cloud provides for a spectrum of choice, rather than a single, compulsory methodology. This provides flexibility and protection in a cost-effective, elastic computing environment. And that’s really what “Web scale” should be all about.
If this white paper where published in an obscure developer blog I would barely read it, but as a white paper published and approved by Microsoft, it illustrates the deep problem that Windows Azure and Microsoft have with NoSQL. Windows Azure only has one NoSQL offering in Azure Table Storage, which is good but severely limited (no secondary indexes). It also seems that Microsoft has very little appreciation of the need and demand for better and broader NoSQL technologies.
Microsoft continues to insist that SQL Azure is the best solution for data in the Windows Azure cloud, which is hardly surprising since SQL Server is a big product for them and their developers (just try and imagine what an Oracle cloud offering would look like). SQL Azure itself is an awesome product, but as a SQL RDBMS, not for anything/everything else.
In order to take advantage of all that Cloud Computing offers, support for various alternative data stores is absolutely necessary. People are building big and scalable systems using a combination of data technologies in order to deliver the optimal solutions in terms of cost, scalability, availability, security etc. In my current project we use MySQL (RDS on AWS), SOLR, mongoDB, S3 and SimpleDB – not for buzzword bingo, but because each one fulfils a specific and important role in the architecture. Only having SQL Azure, Azure Table Storage and Azure Blob Storage is simply not enough.
Microsoft does have interesting products in research. Dryad is their MapReduce implementation (which had been on the fringes for years) and Trinity is their graph database. You would also think that the people who own Bing would be able to rustle up some sort of search engine and that AppFabric cache cloud be tweaked with a ‘DB’ (in the spirit of memcachedb) to give it some persistance.
Update: Dryad seems to be out of research now and in beta as part of Microsoft HPC Pack 2008 R2
Microsoft has smart people and good technologies but needs to get these out onto Windows Azure. Since it is PaaS, it can’t be left up to customers and has to be baked into the platform. Unfortunately, Microsoft has their SQL Server market , customers and developers to protect so I imagine that it will be a while, perhaps too late, before we see a breadth of NoSQL technologies on Windows Azure.