Bottlenecks and SPOFs

Changes, consequences, resiliency

If you’ve ever built any enterprise level system you’ll be aware of the needs of performance and resiliency. You may do performance testing on your application; you may have a backup server in a second data center; you may even do regular “Disaster Recovery tests”.

And yet, despite all these efforts, your application fails in unexpected ways or isn’t as resilient as you planned. Your primary server dies, the DR server doesn’t work properly. Why is this?

Performance tuning and hidden bottlenecks

A few weeks back I upgraded my home internet connection to gigabit speeds. The next day I found that my primary VM felt slow; commands were taking a noticeable amount of time to run when they shouldn’t (eg a simple ls). Investigation showed that my machine (a quad core i5) was maxed out, and a single process rclone was responsible. My offsite backup was running (copying data to Amazon), and I encrypt my data while sending it. By removing the network bottleneck I was now hitting a CPU performance bottleneck.

Now many applications have unexpected hidden bottlenecks as well. A common problem is when a service switches from a production data center to a secondary; this second data center may have higher network latency to their customers and this can impact performance in odd ways (using perl DBD::Oracle do a select of a million rows from a database 10ms away vs 30ms away and spot the difference!). Processing that may take most of a night may now fail to complete before the next work day; reporting deadlines could be missed, and this could have regulatory implications and fines in some industries.

This means that having a “DR” plan may not be sufficient. You should also have a Sustained Resiliency plan; don’t just verify you can shutdown your production database and spin it up in DR; keep it up and running for a week or two to verify that downstream dependent applications keep working and can meet their end-of-day, end-of-week deadlines.

Single Points of Failures

As part of our resilient design we also make choices on what services can be made highly available (e.g. a cluster) vs redundant (e.g. warm standby environment in a secondary location). We may also accept a single point of failure, based on risk factors and cost.

For example, at home if my router died then I could replace it with an openWRT based system, or even a desktop with 2 ethernet cards. It wouldn’t be as performant, but it would work for a temporary basis. Similarly if my main machine died I have a spare that I could get working (it would be messy and the case wouldn’t close…). But… I only have Verizon FIOS. If that dies then I’m off the net (my friend hit this problem recently; his ONT failed). Now I could also get Cablevision as a secondary internet provider, but this isn’t worth the monthly cost for me. So I live with the SPOF. The server this site is served from, though, does have a copy at a second provider - linode and Panix.

Another thing that may show up in a real DR scenario (as opposed to planned ones) is an unexpected dependency. Many DR tests are very fake in nature; they have a planned sequence (close down production database, sync to DR, bring up DR database, test application, switch back again). This is all clean and designed to minimise disruption to the production environment, but it’s not really representative of a real failure mode. What happens if there’s a datacenter outage? This isn’t planned; you won’t be working with a clean replicated database, you may have data-loss or require a database recovery which can impact recovery times.

Or you might find you have an unexpected SPOF dependency; is your DNS managed from the primary datacenter so you can’t modify service entries to point to your secondary? Are all writes sent to a server in the primary location? Is your authentication service mastered there?

Amazon hit this in early 2017; they found they had a massive dependency on S3 and this dependency impacted not only many many other applications but also their own status site! They couldn’t update the status page to tell people about the outage because it depended on S3. Oops!

Conclusion

You might think that modern 12-factor apps might be more immune to these failures, but you’d be wrong. You still need to think about the underlying infrastructure (“the cloud is just someone else’s computer”); deploy to multiple regions, ensure your data is properly replicated, have a sustained resiliency program, and test worst case scenarios. Even if your original architecture was good, implementations mutate and grow beyond the initial design criteria. What may have worked 2 years ago may not work now.

Be aware of hidden SPOFs (could someone DDoS your DNS servers? What would happen if your domain was hijacked?

The cloud doesn’t solve your HA/DR/SR requirements; it just provides tools to assist in building resilient solutions.