Archive for category IaaS
One of the primary influencers on cloud application architectures is the lack of high performance infrastructure — particularly infrastructure that satisfies the I/O demands of databases. Databases running on public cloud infrastructure have never had access to the custom-build high I/O infrastructure of their on-premise counterparts. This had led to the well known idea that “SQL doesn’t scale” and the rise of distributed databases has been on the back of the performance bottleneck of SQL. Ask any Oracle sales rep and they will tell you that SQL scales very well and will point to an impressive list of references. The truth about SQL scalability is that it should rather be worded as ‘SQL doesn’t scale on commodity infrastructure’. There are enough stories on poor and unreliable performance of EBS backed EC2 instances to lend credibility to that statement.
Given high performance infrastructure, dedicated network backbones, Fusion-IO cards on the bus, silly amounts of RAM, and other tweaks, SQL databases will run very well for most needs. The desire for running databases on commodity hardware comes largely down to cost (with influence of availability). Why run your database on hardware that costs a million dollars, licences that cost about the same and support agreements that cost even more, when you can run it on commodity hardware, with open-source software for a fraction of the cost?
That’s all very fine and well until high performance becomes commodity. When high performance becomes commodity then cloud architectures can, and should, adapt. High performance services such as DynamoDB do change things, but such proprietary APIs won’t be universally accepted. The AWS announcement of the new High I/O EC2 Instance Type, which deals specifically with I/O performance by having 10Gb ethernet and SSD backed storage, makes high(er) performance I/O commodity.
How this impacts cloud application architectures will depend on the markets that use it. AWS talks specifically about the instances being ‘an exceptionally good host for NoSQL databases such as Cassandra and MongoDB’. That may be true, but there are not many applications that need that kind of performance on their distributed NoSQL databases — most run fine (for now) on the existing definition of commodity. I’m more interested to see how this matches up with AWSs enterprise play. When migrating to the cloud, enterprises need good I/O to run their SQL databases (and other legacy software) and these instances at least make it possible to get closer to what is possible in on-premise data centres for commodity prices. That, in turn, makes them ripe for accepting more of the cloud into their architectures.
The immediate architectural significance is small, after all, good cloud architects have assumed that better stuff would become commodity (@swardley’s kittens keep shouting that out), so the idea of being able to do more with less is built in to existing approaches. The medium term market impact will be higher. IaaS competitors will be forced to bring their own high performance I/O plans forward as people start running benchmarks. Existing co-lo hosters are going to see one of their last competitive bastions (offering hand assembled high performance infrastructure) broken and will struggle to differentiate themselves from the competition.
Down with latency! Up with IOPS! Bring on commodity performance!
The question: should you use S3 to host a high volume static web site or should you configure and operate a load balanced and auto-scaling EC2 Multi A-Z cluster?
- Ignore cost of storage. You get loads bundled on EC2 and a couple of GB is only a few cents on S3. This is not true for big media streaming of course.
- Bandwidth costs are the same for S3 and EC2 and can be excluded from the comparison.
- Management costs are 1:1 with VM costs. This in turn assumes existing management infrastructure and people are in place and this website is an incremental requirement.
- EC2 will require two instances running. Ideally one in each Zone to achieve a vaguely similar sort of availability target as S3.
- A home page could require circa 100 GET requests (perhaps overdoing it a little bit)
- A UK only web site may only be truly busy for 12 hours per day.
- The cost of a “Heavy Utilisation 1 Year Reserved Instance Small Linux Instance in EU” is $45.65 per month. Two instances: $91.30 per month. Total managed cost: $200 per month.
You would have to make 200,000,000 GET requests per month to reach $200. That is 2,000,000 Page Views per month. Considering 12 hours per day: 91 pages per second. This is a small load shared between two web servers only serving static content. Surely within the reach of two small Linux instances – in fact shouldn’t they serve 10x that volume for the same price?
Because S3 sites are so incredibly simple to setup and have high availability, scalability and performance baked in you can’t possibly justify building up EC2 based web servers at low page volumes. However, the primary cost of S3 is down to GET requests and there are no price/volume breaks in the pricing of GET requests. The costs scale with the volume of requests in a linear way and much faster than they do if you were to build your own EC2 fleet.
If you don’t already have a management function in place for EC2 then the level of scale needed to justify this expense would be considerably higher. The big benefit of S3 sites is in the “static web site as a service” element – i.e. a highly scalable, available, high performance, simple and managed environment.
The linear relationship between scale and costs whilst a disadvantage in one way could be seen as an advantage. The S3 site scales from 0 to Massive and back again instantly. It can remain dormant for days and then fend off massive media focus.
However, I was surprised to see that this wasn’t as cut and dried in favour of S3 as I’d assumed or hoped.
Clouds need Self Service Portals (SSP). I often wonder to whom “self” refers to and I think it would help a lot if people clarified that when describing their products. Is it a systems administrator, a software developer, a sales manager?
I have just read the Forrester report “Market Overview: Private Cloud Solutions, Q2 2011 by James Staten and Lauren E Nelson” which is actually pretty good. They cover IaaS private cloud solutions from the likes of Eucalyptus Systems, Dell, HP, IBM, Microsoft, VMWare etc. What I particularly liked is the way they interviewed and asked the vendors to demonstrate their cloud from a user centred perspective: “as a cloud administrator do…” , “as an engineering user perform…” , “logged in as a marketing user show…”. This moves the conversation away from rhetoric and techy details about hypervisors to the benefits realised by the consumers.
If it doesn’t look like a duck or quack like a duck it probably isn’t a duck.
Forrester have also tried to be quite strict in narrowing down the vendors they included in this report because, frankly, things weren’t passing the duck test. They also asked them to supply solid enterprise customer references where their solution was being used as a private cloud and they found: “Sadly, even some of those that stepped up to our requests failed in this last category”.
Good. Let’s get tough on cloud-washing.
This week the Interweb has been a flutter with the excitement of the release of CloudFoundry . It does sound nice but I feel that it being touted as portable PaaS is not quite how it should be sold.
Never mind the fact that PaaS and IaaS are pretty much terms that are now long past their sell by date so to continue to use them when you want to be ahead of the pack is a bit strange as a marketing ploy anyway.
James (@jamessaull) made a succinct observation when he said that CloudFoundry sounds like PaaS for service providers I have to agree. Forget the technology it’s how much actual work is needed to feed & water the solution . So if I find myself having to worry about the underlying plumbing it’s defiantly not PaaS as per the current dictionary definition. But if someone else was to host the plumbing for me and then all I have to worry about is my application then that’s a whole different scenario with a lot less support & maintenance.
In all this noise it seems some have forgotten that Microsoft have tried to meet this aim of you not having to worry about the plumbing ( ignoring the VM Role) with Windows Azure and yes they have a ‘cloud’ that can sit on your desktop for development purposes too.
I welcome another player to the park especially one touting openness but there are already players out there who shouldn’t be forgotten especially as they have a head start in winning the minds of large corporates whose requirements are quite different from those of the developer community.
To quote @swardley on twitter: “CloudFoundry on OpenStack built on an environment designed using OpenCompute – you can just feel the goodness.”
You know what? I feel just as energised about the announcements in the last week. However I feel the need to comment…
One. Recently OpenStack (and OpenCompute) have been really driving the game forward. I note that CloudFoundry is an open source project and not a Spring project. Cynically this would look like a cheap way to buy industry kudos by spinning off old IP going to waste.
Two. PaaS, to me, implies the following (using AWS Simple Queue Service as an example):
- Increased value over IaaS usually through higher levels of abstraction. I have an API to create/remove queues and to add/remove messages. As a consumer I am not aware of how virtual machines, networks and storage are all orchestrated, scaled, patched etc.
- Described by its service and not its components, and therefore billed for the utility of the service and not the components. I am billed for the messages sent/received and the bandwidth consumed.
- An SLA. Of the Service not the components. If it is made up of VMs don’t let that leak through the abstraction.
- A suite of management and monitoring capabilities that relate to the service and not the components. E.g. messages per second, queue length, latency etc.
SQS is just one example of many. But to build, multi-tenantise (yikes), operate, document, monitor, backup, recover, replicate across data centres in a high redundancy high availability, you-name-it-way is a very big distance from the original queuing technology it might be based on.
So, whilst I am thrilled to see the likes of MongoDB being part of CloudFoundry it is a very long way from a PaaS announcement surely? Let me qualify that. A very long way from PaaS from a consumer perspective. As a Service Provider it might help to have a set of standards and a broad cooperative ecosystem on which to help me construct a PaaS on top of of my IaaS. But as a consumer this does nothing as it leaves me with having to take the underlying technology and using all my skills in IaaS to construct the PaaS and a commercial model around it.
I look forward to blueprinted best practice telling me how to deliver CloudFoundry application platform services atop commodity compute utilities such as AWS / OpenStack. For example, how will I deploy, operate and commercialise Database as a Service using MongoDB as my kernel in a high availability, multi-tenant (you get the idea) fashion? How will I wire it up to the monitoring and billing engine and ensure true elasticity? How will I guarantee that it is isolated from other tenants?
This is the hard part and of course where the value in the “P” is.
Clearly this is a journey, and I want to be excited but I am struggling to see how we really got much further than IaaS with these announcements.
At best it sounds like vCloud for PaaS – i.e. an API spec for queuing as a service, database as a service etc.
Update: A more recent follow-up to this post – AWS leads in PaaS v.Next
It is commonly accepted, when using the IaaS/PaaS/SaaS taxonomy, that AWS is clearly IaaS. After all, if you can run whatever you like on your favourite flavour of OS, then surely AWS is simply offering infrastructure? The common knowledge doesn’t seem to come from AWS themselves (although they don’t overtly deny it) and I have tried to find a document available on their website that classifies them as such. If you can find a document by AWS referring to their services as IaaS, then please provide a link for us in the comments.
This assumption results in interesting behaviour by customers and the rest of the market. Patrick Baillie from CloudSigma is running a series of posts ominously titled Death of the Pure IaaS Cloud where he takes a swipe or two directly at AWS and, using Amazon in his examples, concludes;
So, what might seem like a pretty innocuous decision for a cloud vendor to offer some PaaS alongside its core offering can actually mean less choice and innovation for customers using that cloud vendor.
While it is true that AWSs could, virtually without warning, release a new services that undermines one of the customers’ business models, that is only ‘unfair’ if the business committed to the platform making the incorrect assumption that AWS is pure IaaS.
Maybe there was a lot of IaaS in AWSs distant past, but since it has become mainstream it hasn’t been IaaS. As a user of AWS, I don’t see the bulk of their services as infrastructure. I see them as some sort of platform, if you will.
Take S3 (Simple Storage Service) for example. I interact with it using a proprietary API (from AWS) rolled up in my framework of choice and I don’t simply receive storage. I receive a good model of handling security tokens, object accessible via the web (via my own domain name), logging, 11 nines of durability, automatic multithreaded chunking of large files and, with the click of a button, a CDN thrown in. That is so far from storage infrastructure (logical disk or SAN) that it cannot be called infrastructure. EC2 may be considered the most infrastructure-y part of AWS, but there is a whole lot bundled with EC2 such as load balancing and autoscaling which makes EC2 less ‘virtual machine infrastructure’ than you would think.
The fear of AWS as the gorilla that locks customers into it’s platform and, by virtually owning the definition of cloud computing, be able to have huge growth and market penetration must be of concern to it’s competitors. Perhaps, as Patrick points out, it’s market dominance and business practices may negatively affect innovation and influence consumer choice. This is something that we are used to in IT with market gorillas such as IBM, Microsoft, Oracle, Google, Apple and even Facebook. We, as IT consumers, have learned to deal with them (maybe not Facebook yet) and we will learn to deal with AWS.
Competing with AWS by attacking their IaaS credentials will fail, they are simply not an IaaS vendor. These aren’t the droids you’re looking for, move along.
Focusing on the economics of SaaS depends on a couple of things: efficiency and high resource utilisation. In other words the SaaS supplier wants every one of their assets to be turned to the greatest amount useful work whilst being able to endlessly scale to accommodate more and more customers; ideally the increasing volume should reflect a decrease in cost to serve each.
Multi-tenancy is one way to statistically multiplex a high volume of non-correlated, efficiently executed workloads to deliver very high customer:asset density – thus leading to high utilisation (only of course where economic well being is known to be a function of utilisation). SaaS usually has another property – self service, especially at sufficient scale. Scaling costs linearly (or worse) with consumer volumes is not a good business. Therefore customer provisioning has to be highly automated, swift and efficient.
This introduces customer-to-asset density. Imagine you chose to serve 10 customers with your enterprise web application. You might follow this evolution starting with the ludicrous-to-make-a-point:
- Procure 10 separate data centres, populate with servers, storage, networking, staff and so forth. Connect each customer to their data centre.
- Procure 1 data centre and 10 racks…
- Rent space for 10 racks in 1 data centre…
- Procure 1 rack with 10 servers…
- Procure 1 server with 10 Virtual Machines…
- Procure 1 server and install the application 10 times (e.g. 10 web sites)…
- Procure 1 server and install the application once and create 10 accounts in the application…
As we go down the list we are increasing the density and utilisation (assuming application efficiency is constant) and we meet different challenges on the way. At the start we have massive over-capacity, huge lead times and the expensive service is dominated by overheads. By stage 4 we have some real multi-tenancy happening via shared networking, storage and physical location etc.
At Stage 5 we notice that, when idle, each Virtual Machine is consuming plenty of RAM and a small amount of CPU. Even when busy serving requests each application is caching much the same data, loading the same libraries into memory etc. The overheads are still dominant. Might not sound too bad for 10 customers, but 10,000 customers each given 1GB of RAM for the operating system means 10TB of RAM. Those customers could be idle most of the time but you are still running 10,000 VMs just in case they make one little HTTP request…
So we are motivated to keep on pushing for more and more density to reach higher levels of utilisation by removing common overheads. Soon we are squarely in the domain of over-provisioning. This is where we know we can’t possibly service all our customer’s needs all at once and we get into the tricky business of statistical multiplexing: one man’s peak is another man’s trough. When one customer is idle another is busy. The net result: sustained high utilisation of resources. This takes a lot of management and some measure of over-capacity to ensure that statistical anomalies can be met without breaching SLA. Just how much over-capacity though?
All of a sudden, as a software services provider, you became embroiled in the complexity of running a utility. Building commercial models that take into account usage, usage patterns, load-factors and trying to incentivise customers to move their work away from peak times, trying to sell excess capacity by setting spot prices to keep utilisation high and all assets earning some revenue.
In reality this is the realm of utility compute providers and by running thousands of different workloads across many industries and time zones they stand a far better chance of doing it. Electricity providers provide power to railways and hospitals alike, but they don’t try to sell you a ticket or give you an x-ray. By reverse, railways and hospitals don’t try to generate their electricity. Not a new point, and not a new analogy.
Above – a quick sketch showing how today it is very hard to achieve very high utilisation, but the advent of increasingly complete PaaS offerings and “web-scale” technologies this is rapidly changing.
SaaS Maturity Model Level 1
Back to pushing along the density and utilisation curve. Public IaaS has solved the “utility problem” element of the task: without any capital expense I can now near-instantly, 100% automate the provisioning of new customers. Using AWS (as an example) and CloudFormation / Elastic Beanstalk, and maybe a touch of Chef, each customer gets their own isolated deployment that can independently scale up and down. Almost any web application can fit into this model without much re-engineering and by normal standards this is quite far along the utilisation curve. Every month the AWS bill for each customer’s deployment is sent to them with some margin added for any additional services provided such as new features, application monitoring, support desk etc.
Common technologies such as Chef, Capistrano etc. allow us to efficiently keep this elastic fleet of servers running the latest configurations, patches and so forth. Each customer can customise their system and move to newer versions on their own schedule – inevitably the service becomes customised or integrated to other software services and not all meaningful changes are trivial and non-breaking!
This simple model uses traditional technologies and is little more than efficient application hosting. The upside is that it isn’t much of a mental leap and requires off the shelf components and skills.
But is that enough? Are there cost benefits to going deeper?
SaaS Maturity Model Level 2
What if most customers are not BIG customers whose demand is always consuming a bare minimum of 2 Virtual Machines? What if a huge number of customers can barely justify 5% of a single VM? The task is now to multi-tenant each VM. A simple model would be to install the application 32 times on one VM and 32 times on another and then load balance the two machines. Each customer application instance would consume 2.5% on each machine (making 5%) leading to a total of 80% utilisation of each VM. An Auto Scale Elastic Load Balancer could always provision some more machines if needs be for burst / failure occasions.
This is still a simple and easy-to-achieve model and feels a bit like that story about filling a bucket with stones of different sizes. You might only get one big stone in a bucket, but you could get 10 smaller stones and 20 pebbles instead. But what about sand that represents the long tail of customers. Would you install the application 1000 times on a VM? How do you know the difference between a stone, a pebble and a grain of sand in a self service world? Are you going to run a really sophisticated monitoring and analytics system to work out the best way to fill your buckets, adjusting it frequently?
Still a simple model, and probably fit for purpose for a whole category of applications, workloads and customer segment. Importantly it relied mainly on IaaS automation and standard applications. However it has limitations and begins to lead back to many of the problems that face a utility provider – just at different layer.
SaaS Maturity Model Level 3
An improvement would be to deploy the application once into a single VM image sat behind an Auto Scaling Elastic Load Balancer (ASELB). The single application would use a discriminator (such as a secure login token or URL) and this discriminator would flow all the way through the application. Even database access would, for example, be only achieved via database views. These views would be supplied with the discriminator to ensure that each application only ever worked with its own data set.
A huge number of customers of all sizes would be simply balanced across a huge fleet of identical machines establishing very high levels of utilisation. It probably required some hefty application re-engineering to ensure tenant isolation and maybe even cryptography in some circumstances to ensure trespassing neighbours would only find encrypted data and only the customer supplies the key at login. It has to be said there aren’t many obvious pieces in the common application frameworks to help either – e.g. the discriminator concept being strictly enforced.
Whilst multi-tenant web applications can scale wonderfully the database comes into sharp focus. Elastic scaling up and down on-demand from a CPU, IO and storage perspective just isn’t a strong suit. Clearly some further re-engineering can shift the workload towards the application tier via techniques such as caching or data-fabrics that attempt to behave as in memory replicated databases. Perhaps the way the application works can be broken apart to employ asynchronous queuing in places to better level the load against the database. Either way, the classic database could continue to be barrier.
SaaS Maturity Model Level 4
At level 3 we managed to have one super large database containing all data for all customers, using a discriminator to isolate the tenants. It is very likely that unrelated data entities were placed into separate databases. Instead of having one server scaling we could have several. It is also possible that analytic workloads were also hived off on to separate servers. Maybe we even used database read-replicas to take advantage of the applications heavy read:write ratio. We probably also dug deep into modify the database transaction isolation levels and indexing. We maybe even separated out “search” into another subsystem of specialised servers (e.g. SOLR).
Sharding came next – attempting to split the database out even further. Maybe each customer got their own collection of databases, and each database server ran several databases. This is a bit like the Bucket analogy again – how do you balance the databases across servers? Some customers may have large amounts of data, but very few queries. Some may have little data but hundreds of users bashing away and running reports. Whilst this is a tough balancing act, it has a huge impact on your commercial model. What is expensive for you to serve? Data volumes? Reports? Writes? Accessing the product tables or the order history tables?
Whilst trying to balance all the customers and shift them around you are also trying to scale up and down with usage. Databases have very coarse controls, and shifting databases around and resizing them can mean downtime.
This is the sort of problem that the controlling fabric in SQL Azure solves and it is far from trivial!
Other partitioning schemes also exist such as customers A-F on one server, G-L on another etc. This too runs into issues of re-partitioning and rebalancing during growth/shrinkage and when hotspots arise. The classic old MySpace example is when Brad Pitt sets up a public page and all of a sudden the “B” partition attracts more heat than all the other partitions put together.
Level 4 is marked by exploring beyond the RDBMS as the only repository type. At this point it is key to thoroughly understand the data, access patterns, the application requirements and so forth. Are some elements better served by a graph database, a search technology, a key-value store, a triple store, a json store etc. A thorough scrutiny and knowing the options is vital. When Facebook look to introduce a new capability they don’t just try and create a 3rd normal form schema and put it into Oracle. That thinking is no longer sufficient.
Repositories that support scale-out, multiple-datacentre replication, eventual consistency, massively parallel analytics etc. This is an increasingly well trodden path and many of the technologies have come out of companies that have demonstrated they can solve this problem: Google, Facebook, Amazon, Microsoft etc. Names such as BigTable, SimpleDB, Azure Table Storage, MongoDB, Cassandra, HBase, Hadoop to name some popular ones.
SaaS Maturity Model Level 5
Level 4 saw some sophisticated re-engineering work to try and transform some regular technology to elastically scale with high utilisation levels at both the application and data layer and remain manageable!
The next level is probably the biggest discontinuity and probably doesn’t really exist yet as it is a blend of Google and Amazon. It takes a bold step and ignores the old notion of a Virtual Machine as a clunky interim solution. Applications don’t require operating systems, they require application platforms. Google’s App Engine is possibly the best example of an Application Platform. Even Azure as another popular PaaS makes it clear that you are renting Virtual Machines and that these are your units of scale, failure, management and billing. With GAE you pay for resource consumption – the number of CPU cycles your code executes – not how many VMs you have running and have desperately tried to fill up. This is like grinding up all the stones and pebbles and just pouring sand in to the bucket.
In order for this model to work the applications have to exhibit certain characteristics necessary for performance and scale, and the platform enforces it. For many this is a painful engineering discontinuity – and probably means a complete ground up application re-architecting.
PaaS provides the same fine-grained and utterly elastic data store capability too. The same goes for all the other subsystems such as queues, email, storage etc. The key point here is that each element of the application architecture is made up of Application Platform Services that have each solved the utility problem, are endlessly elastically scalable and charge for the value extracted (e.g. message sent, bytes stored, CPU cycles consumed not CPU cycles potentially used).
There is a reason people use IaaS as a foundation for building up PaaS and how SaaS sits more comfortably atop PaaS. Whilst GAE is probably one of the most opinionated PaaS architectures and suits a narrow set of use cases today, no doubt it will improve over time. Amazon and Microsoft will also continue to abolish the Virtual Machine by providing an increasingly complete Application Platform e.g. Simple Queue Service, Simple Notification Service, SimpleDB, Simple Email Service, Route53. At the same time Facebook, Yahoo and others may continue to open source enough technologies for anyone to build their own!