The Importance of Provider Redundancy

Back in the days before cloud computing, it used to be accepted (if somewhat resented) by management types that having redundant systems in place was important if you cared - even a little - about uptime.

In today's industry, those same management types generally understand that it's still important to have multi-region availability, with instances running in completely distinct provider regions, so that an outage in one area doesn't impact your ability to do business.

What doesn't seem to be quite so widely understood, or accepted, though is the importance of ensuring that systems have redundancy across providers. It's not just management types who are making this mistake either, we've all encountered techies who are seemingly blind to the risk and view it as an un-necessary additional cost/hassle.

Rather than typing "the provider" throughout this post, I'm going to pick on AWS, but the argument applies to all Cloud providers.

 

The Argument

When you raise the subject of provider redundancy, you'll often receive a reply that it's not required, with at least one of the following points advanced:

  • AWS have five-nines (99.999%) uptime
  • We're running in multiple regions
  • AWS are a much bigger provider than us, so they're less likely to have an outage than we are
  • It'd be expensive to run on another provider too

These arguments, on the face of it, are not without merit, but realistically they're simply a symptom of just how good the marketing of cloud infra has been, and as we'll see are not as strong an argument as they may first seem.

The last isn't actually really a valid argument when made in isolation - it wasn't valid in the self-hosting days, and isn't valid now unless it's supported by one of the other reasons to "show" why having a redundant supplier is un-necessary. It's only a waste of money if it's unnecessary, after all.

 

Flawed Arguments

The real issue with most those arguments, though, is that they focus on the technical to the exclusion of all else.

That's really not a good way to make business decisions, as it opens you to an inordinate amount of risk from aspects that you've excluded by having too narrow a focus.

Every single one of those arguments can be rendered irrelevant by account level issues:

  • Billing issue (or mistake) at AWS leads to account suspension
  • Instance/account compromise (including by disgruntled insiders) leads to account suspension
  • Account compromise leads to instance deletion
  • Provider screw ups

If any of those happens, it doesn't really matter to you whether AWS achieves even nine nines uptime, you are now having an outage.

Account Compromise/suspension

Account suspension as a result of compromise will likely happen without any advance notice, and may not be something that's resolved easily.

There tends to be a temptation to assume something along the lines of "we're a big customer, and have an account manager so they wouldn't just suspend us like that".

This is also a poor argument against provider redundancy, as it ignores various possibilities - imagine one of your instances gets compromised and starts serving Child Porn, or some other illegal content. AWS receive a call from the police to report it. Odds are, your entire account is getting suspended while it's investigated.

The argument also ignores the level of automation at most providers. Automation that may very well suspend an entire account without human intervention.

Provider Screw-Ups

The risk of your provider screwing up is also often more real than any of us like to admit. It's easy to claim that because they're much bigger than "us", it's unlikely they'll make mistakes, but this ignores that the risk of mistakes is still about the same once you get down to an individual level.

Sometimes those mistakes are small, but hard to fix quickly - Google's post-mortem on it's recent 4 hour outage is a good example of this - a small mistake was made, deploying a config update to the wrong boxes, but the result of that mistake actively prevented engineers connecting in to rectify the issue. Google aren't the only provider to have done this either, AWS had a massive outage in 2017 as a result of a mistake whilst debugging billing. Azure isn't excluded either, having had a 3 hour outage following a DNS misconfiguration.

All this should, hopefully, show that each of the "technical" arguments for why an outage couldn't happen boil down to nothing stronger than "it couldn't possibly happen to us".

 

The Costs of Being Wrong

Ultimately, the issue of whether Provider Redundancy is required needs to be treated as a full risk-assessment rather than as a simple technical argument.

What is the cost to your business, per minute, of downtime, and how does that compare to the cost of using a redundant provider?

In the Digital Ocean example given above, Raisup were locked out for 12 hours. So that's no revenue for 12 hours, along with the loss of face with their customers (who presumably, responded by asking why redundancy wasn't in place).

The Digital Ocean example is a fairly poor example of customer service, but that's a 12 hour outage for something where a script took an action, a human checked it and reversed it, and then the script re-implemented the block.

The means of calculating costs of downtime are fairly well established, and there are even online calculators to help reach a figure, but essentially boil down to calculating three forms of loss - Revenue loss, Productivity loss and additional loss

As an example, using the following values for an online shop

  • Revenue: £1,000,000
  • Revenue generated by services: 100% (if the site's down, no-one can buy)
  • 40 Employees affected
    Employees paid just over UK minimum wage (£9.00) 
  • Most employees are in the warehouse shipping orders, so impact to productivity is 80% 

We can calculate the loss as follows

 

Productivity loss

Productivity loss is the cost of having employees who now cannot do their jobs, but must still be paid for the time they're at work.

It can be calculated using the following formula

E x e% x P x H

Where

  • E = number of employees affected
  • e% = Percentage they're affected
  • P = Hourly wage
  • H = Number of downtime hours

So, for our example shop the productivity loss for 1 hour can be calculated as

40 x 80% X 9 X 1 = £288

 

Revenue loss

Revenue loss is the loss of revenue caused by the outage. This might be represented by being unable to take orders through your webshop, or in hours you need to refund existing customers because they couldn't use their services

It can be calculated with the formula

(R / T) x i% x H

Where

  • R = Total gross yearly revenue
  • T = Total annual business hours (for a webshop, this will probably be a full year's hours - 8760)
  • i% = Percentage impact on revenue
  • H = Number of downtime hours

So, for our example shop the revenue loss for 1 hour can be calculated as

(1,000,000 / 8760) x 100 x 1 = £11,415.53

 

Additional Loss

Additional losses can be a little harder to fully identify ahead of time. There will almost always be some additional loss, but not all of it will be predictable.

For example, will your employees need to work overtime in order to help bring things back into service once AWS have resolved whatever issues exist at their end? Will you miss some form of contractual delivery deadline, and have to pay penalties as a result? Will the downtime affect your supply chain? Was the outage big enough that you need to hire a PR company to help manage the ensuing fall-out?

In the example of our online shop, perhaps a delivery at the warehouse couldn't be checked in, so the staff now need to work overtime to check it in - or worse, perhaps that delivery had to be turned away?

 

Single Provider Dependency can be doubly harmful

Not requiring provider redundancy can actually prove to harm you even when there isn't an outage.

If your services run on AWS, it's more than possible that dependencies on AWS-only features may creep into your product over time. All it takes is a few API calls to creep in, and migration to a new provider becomes that much more difficult.

There might be various reasons that you decide you need to run your services on another provider. It may be that you've just had an outage, and now wish you'd paid heed to this post sooner, or it might be that AWS have raised prices and you feel another platform is that much more competitive.

Whatever the reason for wanting to migrate though, if single vendor dependencies have crept into your systems, it introduces additional cost and delay into migrations. It is, in principle, no different to any refactoring of code.

In a scorched earth scenario where AWS has locked you out and gone unresponsive, you'd be delayed from deploying anew on a new platform because you'd first need to find and rectify every AWS specific dependency.

Requiring that you run services on a redundant provider helps to avoid this, as it requires that services be written in a provider agnostic manner from the outset.

 

Conclusion

The cloud providers have done such a fantastic job of marketing their resiliency and uptimes that it's not uncommon to find even techies who think that simply deploying into multiple regions is sufficient.

Even very recent history of the cloud services, though, shows us that this is a fallacy. The technical uptime stats that these platforms achieve are impressive, but there are still multiple opportunities for you to have an outage, even if the cloud platform itself doesn't.

The largest providers - AWS, Google and Azure have all accidentally knocked themselves offline for hours in the past, and will again.

Any argument that they won't is akin to doing nothing on the basis "it could never happen to me" - it's simply wishful thinking and should serve no place in any mature business decision.

Accepting that you require provider redundancy brings other benefits too, as it means that your solutions become vendor agnostic, potentially giving you greater flexibility in changing provider for commercial reasons.

Ultimately, there's always a possibility that a proper risk-assessment will show that having a redundant provider may not make sense for your business. That's OK too, but it's a decision that should always be the result of a proper investigation into the risks. If you do go that route though, make sure you store your backups outside of that provider (see earlier Digital Ocean example).