Meltdown and Spectre

04 Jan 2018, 18:57

public cloud / risk / threat

Unless you’ve been living under a rock, you may have heard of two panic panic panic bugs, known as Meltdown and Spectre. People are panicking about them because they are CPU level issues that may impact almost every modern CPU around. Meltdown is Intel specific, but Spectre affects Intel, AMD, and potentially others (Redhat claims POWER and zSeries is impacted).

What is the problem?

In short, modern CPUs may execute instructions out of order, especially when the order doesn’t matter. If you have

a:=1
b:=1

then does it matter what way round they run?

Now modern CPUs also speculate about branch instructions and may pre-emptively execute the branch it predicts as being the one to be followed. If the guess about the branch is correct then the out of order execution results are valid and your code runs faster. If it guesses wrong then “no harm, no foul” we just throw away the guessed results and run the right branch.

Except it turns out there is a foul. The predicted branch impacts the CPU cache. And it turns out it’s possible to use this fact to derive further information about memory locations you shouldn’t normally have access to.

The result is an information leak. Inside a single VM one process could read kernel memory data. Want to read secrets out of memory? It’s theoretically doable. Worse, on a virtual server one VM could read memory from another VM. Oh dear, do you trust everyone sharing the Amazon Cloud server that your EC2 VM is running on?

Why are people panicking?

CPUs have bugs all the time. Most of them can be patched at boot time with a microcode update. But in this case the problem is at a lower level; it’s all to do with the cache. Now the Meltdown issue can be mitigated at the OS layer by modifying how the kernel handles virtual memory. This mitigation isn’t free; it may have an estimated 5->30% performance impact. That sounds like a lot! The 30% impact, though, is almost the worst case scenario (in testing it required a specific pattern of reads to an SSD). More commonly the impact is 5%, and if your application is compute intensive then there may be almost no impact (I saw some gaming desktop testing where the patch had an impact smaller than measurable error). But 30% is the headline number.

The Spectre bug apparently can not be fixed through software. We can block known paths to exploit and make it harder to exploit (it’s already pretty hard to do more than a proof of concept) but we can’t block it totally.

So people are now panicking like mad; do they apply the patches ASAP to prevent exploitation and risk the performance hit, or do they leave servers exposed?

Why am I not panicking?

In essence this is a read-only local privilege escalation bug. Sure, it’s at the hardware layer rather than at the kernel, but the impact is still comparable to that of local privilege escalation.

In comparison, Shellshock was remotely executable and trivial to exploit. That’s scarey.

Many places tend to have a ‘one workload per VM’ model. Privilege escalation, there, isn’t so great a risk. However hypervisor technology means that one physical server can serve multiple logical VMs and so multiple workloads. Similarly container technology allows one VM to handle multiple workloads.

So I look at it from layers:

“Top priority”
Hypervisors should be patched. This protects one VM from another. A cloud service provider should treat this as critical, because they host untrusted workloads (indeed Amazon and Microsft have already patched and rebooted a majority of their servers; operations at scale!). Inside your enterprise your VMware/KVM/Xen/Openstack/… environments should be patched. Any host running containers (eg docker) should be patched. Any on-premise cloudy environment (e.g. Cloud Foundry, Apprenda) should be patched.
Basically any environment with “neighbours” should be treated as top priority priority, so one neighbour can not impact another.
“High”
Server OSes should be patched, of course, but the urgency is less.

Risk management

Of course you don’t want to leave known vulnerabilities in your environment, but the whole point of measured risk management is to understand the impact of a vulnerability on your organisation and prepare an appropriate response. Ask yourself questions around ease of exploitation, risk of exploitation, consequence of exploitation. At present this appears to be a local read-only privilege escalation issue; what is the impact?

Can remote attacks trigger the bug? Well with Spectre there’s a proof of concept javascript attack, so maybe ensure your desktop browsers are up to date. Do you have code that may dynamically build eBPF filters into the kernel? If so, ensure the inputs are trusted (the right thing to do, anyway).

Your organisation should already have vulnerability management processes to handle new vulnerabilities. Treat these the same way. It doesn’t matter that it’s a hardware issue; it’s just another issue. Follow your patch guidelines, follow your processes. Don’t panic.

Unless you’re a cloud vendor of course. Then panic and patch immediately! You don’t want one customer stealing data from another. Your risk analysis is different to a traditional enterprise!

Summary

Bugs are bugs; whether it’s a hardware bug or a kernel bug, or an application bug. Follow your processes.

What’s interesting is that this is next along the class of attacks at the hardware level. Rowhammer showed that memory could be made to do things outside of the knowledge of the OS. Meltdown and Spectre show that the CPU, itself, can be abused in a similar way. What will be next? What core assumption are we building on that is shakey? That might bring the whole house of cards known as “secure computing” tumbling down?

Don't panic! Don't panic!

What is the problem?

Why are people panicking?

Why am I not panicking?

Risk management

Summary