days
-1
-8
hours
-1
-1
minutes
-1
-5
seconds
-4
-6
search
An analysis of some corner-case performance issues with Meltdown patches

Meltdown, Spectre and Linux on AWS: Security vs. performance

Colin Panisset
Meltdown

© Shutterstock / BeeBright

The Meltdown and Spectre story continues. Even though Amazon Web Services’ (AWS) response shows that they’ve already patched and protected their infrastructure you still have work to do. AWS’ Shared Responsibility Model means that you are responsible for patching the operating system running on your EC2 instances, and this is where things get … complicated.

If you want the TL;DR from all this, here are a few general rules to follow:

  1. Run your EC2 instances using the most recent AMI that you can which uses the HVM virtualisation mode
  2. Patch your operating systems to make sure you have the Meltdown fixes applied
  3. Update to more recent EC2 instance families
  4. Run the latest Linux kernel you can to ensure you have PCID support

Problems at the lowest level

Let’s start off with some basics. The bugs, which exist in all Intel CPUs manufactured since about 2013 (codenamed “Haswell” and later), allow malicious processes to steal information that would normally be protected, such as passwords, credit card numbers, and so forth, while that data is being processed by the CPU. This is due to flaws in the CPU itself and has nothing to do with Windows, Linux, Mac OSX, or any other operating system. The CPU cannot be patched – it’s hardware – and so we must rely on fixes to the systems that run on top of those CPUs.

Fixes applied one level higher

There are generally two classes of a system which run directly on a CPU: an operating system, like Linux or Windows; or a hypervisor, like VMware ESXi, Xen, or Amazon’s KVM-based proprietary hypervisor.

If a hypervisor is run on the CPU, it hosts other operating systems (like Linux and Windows).

Applying patches to this first layer can protect against both Spectre and Meltdown attacks, with varying degrees of performance impact.

Virtual machines running on top of the hypervisor still need to be patched in order to protect processes running within their operating systems from exploits. These patches will themselves apply potential performance impacts as well.

SEE ALSO: Spectre and Meltdown make anything with chip in it vulnerable, but Raspberry Pi is safe

Well-known performance impacts

Intel expects that performance impacts of around 6% will be imposed as a result of fixes for the vulnerabilities (see References); independent testing on Linux systems has measured 5-30% performance impacts (depending on the certain workload); Microsoft estimates performance impacts but is being cagey about actual numbers (see References).

On Linux, patches against Meltdown implement a feature called “Kernel Page Table Isolation” (KPTI), which impose performance impacts whenever a user-land process executes a system call, transferring control from the application code into the kernel (for example, whenever data needs to be read from or written to a disk, or whenever network communication happens).

These performance impacts depend on exactly what kind of work an application does, based on how often these system calls need to be executed, but in general, the performance penalty should be restricted to that application and not affect other processes on the same system.

Right?

Well, not quite.

An obscure feature becomes critical

Intel CPUs since 2010 (codename “Westmere”) have supported a feature called PCID (process context ID) which, for the past 7 years has been fairly boring and unsupported by Linux kernels, because it didn’t really do anything much for performance or security. Starting with kernel 4.14, it’s been supported – though more from completeness for a minor capability improvement than as a critical feature.

It turns out that PCIDis important in alleviating some of the performance impacts of the KPTI patches, and in preventing one application from killing system performance for all other applications. You see, the kernel maintains a Translation Lookaside Buffer (TLB), which is kind of like an index for the mappings between kernel and userland memory pages; when a system call crosses that userland/kernel boundary, kernels running on processors without PCID support must throw away the TLB and start again, increasing the amount of time it takes to execute frequent operations.

But just because all modern CPUs and Linux kernels support this feature, doesn’t mean that you can use it on AWS.

HVM, PV and instance families, on my!

AWS’ original EC2 instances all ran on top of a hypervisor which provided paravirtualised (“PV”) interfaces to the guest operating systems, which hide some of the features of the underlying CPU, including the PCID capability.

More recent instance families (along with some of the older ones) run on a newer hypervisor which exposes more of the underlying capabilities; this virtualisation mode is called “HVM”, which stands for “Hardware Virtual Machine”.

Although almost all EC2 instance families (like t2m3c4) are available in the HVM mode, they don’t all actually expose the PCID feature, which you need in order to avoid the worst performance penalties.

Help me, Obi-Wan!

Lucky for you, we’ve done some research and mapped the EC2 instance families against virtualisation modes and CPU features to tell which combinations are least-affected. The following table shows what’s what:

You can see, at a glance, that no PV instance types provide PCID – avoid these, to avoid the worst performance impacts.

You can also see that even if you choose HVM as your virtualisation type, some instance families still don’t expose the PCID feature – you should avoid these as well.

Frustratingly, the hs1 (or “high storage”) instance type is perhaps worst affected; it’s most commonly used for workloads that need to make a lot of disk I/O system calls, thereby bringing the highest amount of performance overhead from the Meltdown patches, and it doesn’t support PCID, meaning that you’re losing out twice over.

This is all very confusing …

I’ve tried to keep a balance between enough technical information and, where possible, useful simplifications.

If you’d like some assistance working through this mess, please contact us and we’d be happy to see what we can do to help out.

References

No good article is complete without references, right?

  1. Intel Offers Security Issue Update
  2. Understanding the performance impact of Spectre and Meltdown mitigations on Windows Systems
  3. Meltdown and Spectre
  4. Kernel Page-Table Isolation
  5. EC2 instance types
  6. PCID is now a critical performance/security feature on x86

 

This article was originally published on Cevo Australia

asap

Author

Colin Panisset

Colin is the director of Transformation and Delivery at Cevo Australia. He brings his history of operational excellence together with a desire to improve delivery practices to drive business value through improving culture, empathy, automation and technical excellence. He has a knack for understanding and explaining complex systems, and has presented at events like AWS Summits, Devops Days, and many meetups.


Leave a Reply

Be the First to Comment!

avatar
400
  Subscribe  
Notify of