Posts Tagged Operations
When extolling the virtues of cloud computing we often proudly mention ‘infinite scalability’ without giving a moments thought to the infinitely large bill as an unfortunate side effect. Take a scenario where an application is under load and needs to be scaled up significantly.
Ops: We have noticed a massive spike in traffic, what should we do?
Business: If you add more nodes, will it handle the traffic?
Ops: Yes, but it will cost eleventy dollars an hour.
Business: Any idea where the traffic is from?
Ops: We’re seeing a lot of referrals from Google ads.
Business: Okay. Add the nodes, I’ll get back to you in an hour
<two hours later>
Business: It seems that an adwords campaign is underway, so leave the nodes running.
Ops: How long will it run for?
Business: It will run for the rest of the week
Ops: Okay, I’ll put in some bids for reserved compute for the rest of the week, it should bring the costs down by 7%.
Business: Great, but keep an eye on the load and shut some of the nodes down on Friday if the traffic is still high.
The point of the above hypothetical scenario is to illustrate that although we can scale based on technical metrics, those metrics should only be used for temporary scaling. It is only once we have input from business that we can make better scaling decisions.
Current autoscale mechanisms are based on simple metrics, not those that interface with the business. The interface with business still involves people and cannot be automated. This means that for all our fancy scalability that we have in our cloud platform, the process of scaling itself is not scalable.
With cloud computing we want to automate as much as possible and have come a long way (for example, automating the recycling of virtual machines), but the ability to autoscale is still immature. Individual applications may implement some of this in a bespoke fashion, but frameworks and environments need to exist in order to provide a standardised and simplified method to interface with the business, so that we can embed the functionality in our applications.
Describing the simple (and simply drawn process below)
- We still need autoscaling as we want to respond immediately but the subsequent steps are the most crucial.
- We need to know if there is a good reason for the current load. Perhaps there is a planned campaign that is underway and we need to look in the system that has special offers and sales, vouchers, advertising campaigns or even customer complaints in the CRM system.
- We should also be able to determine the business value of the load. Perhaps we can look to see if the increase in traffic results in extra orders, bookings or decreased call centre calls. (This may need to be delayed so that we can see the impact of the extra load as it propagates through various systems).
- We then make a decision as to whether there is any value in maintaining the added scale or if the dial should be turned back and service degraded.
If anyone has built something like this that they have applied generically to their applications, tell us about it in the comments. Likewise, if anyone is building (or has built) such a framework, we would be interested in hearing about it.
It is probably a few years down the line, but these are the sorts of issues that we need to focus on once we have the simple stuff (like keeping n instances running) sorted out. The value of cloud computing has to be arguable and articulated to business and until we can show how scaling works that makes sense to business, they will never get what we mean by ‘infinite scalability’.
Even a hiatus of a month can make you feel that you’ve fallen behind when working with Cloud services such as AWS and Azure. I’ve found that there is no needs to panic though just treat the fact finding involved with exploring the new features as a spike.
A spike is basically a time boxed period where you can explore new features and validate areas where you are unsure. Spikes are generally thrown away and the validated bits introduced as part of a new user story (you can look up the text book definition but for the purpose of this brief post that will do)
I have found that those from a Development background readily embrace this approach whereas those from a more traditional Operational background tend to be more reluctant . The onus must in my opinion be on the Dev team to start working closer with the operational team and get them to start thinking like developers do when developing in an agile environment.