Even after all this time I hear statements like “Oh, we can just run our code in the cloud”. This is the core of the lift and shift school of cloud usage.
And these people are perfectly correct; they can just run their stuff in the cloud. But it won’t work so well.
I’ve previously written about lift and shift issues, but here I want to focus on the “resiliency” issue. People get annoyed when I point out that their design is unreliable and subject to failure modes.
“Oh but the cloud is reliable!” they cry. “Major companies like Netflix and even Amazon themselves never go down.”
If only it was that simple…
In a traditional datacenter environment a lot of work has been put in place to make underlying infrastructure reliable. We spend lots of money trying to get to that mythical “five nines” reliability. We build clusters of servers and use VMware vmotion to migrate workloads around the cluster to allow for nodes to be taken down for management. We dual connect machines to independent network switches. Similarly for SAN fabrics. And, on top of all that, we then duplicate all this in another datacenter and configure our application stacks for HA (e.g. Oracle data replication, global load balancing).
This is expensive to build, and hard to maintain. And very easy to get wrong. Who here has not been stuck on a “DR Failover” call where the DR component hasn’t started up properly because testing has assumed clean shutdown of primary… something that doesn’t happen in a real failure.
No wonder people want to move to the cloud; it’s so much more reliable! Except…
The cloud is not reliable
If you look at SLAs from major Cloud Service Providers (CSPs) then you’ll see they talk about a service SLA. Let’s take Amazon EC2 as an example:
The Service Commitment does not apply […] that result from failures of individual instances or volumes not attributable to Region Unavailability
Basically, Amazon provide a “four nines” SLA for a region, and not for individual running instances.
Your CSP running instance is probably less reliable than the VM you run in house. Maintenance likely isn’t going to happen on your schedule!
Plan for failure
This is a core requirement for moving to the cloud. If you want to get cloudy infrastructure to be as resilient as your on-premise stuff then you need to build solutions that match the existing patterns and designs. You may no longer have the ability to do things like VMotion, so how will you take instances down for patching? You will need to replicate databases, even cross region. You’ll still have requirements for DR testing, and this will be even more important, because of the lower underlying resiliency.
So now you have more VMs, more data transit charges, more headaches… this cloud lark isn’t looking so friendly now, is it?
Or you can look at how companies like Netflix actually get their reliability; their application assumes failure.
This turns the existing reliability model on its head; instead of having four-nines infrastructure that the app can rely on (and so mostly ignore), assume we have three-nines (or even two-nines!) and the application has to make up the difference.
Developers now take on the majority of the resiliency requirements. There’s a whole new set of programming patterns that need to be used; keeping state externalised, using a circuit break pattern, providing service state visibility… Tooling such as Netflix Hysterix can help with this.
Both approaches are possible; each have their pros and cons, challenges and costs.
You can make friends with the cloud. It requires thinking about. An application redesign host a lot of up-front costs but can lead to long term benefits. A lift-and-shift approach is simpler but can bring in a lot of legacy processes and procedures (“ugh, another broken DR failover!”).
Whatever you do, though, you can’t ignore resiliency and reliability in the cloud; you don’t get it “out of the box”. Done properly, however, then the cloud can start to work for you, and you’ll be best of friends and never want to work with traditional methods ever again!