We're hiring!
*

Increased performance of emulated NVMe devices

Helen Fornazier avatar

Helen Fornazier
August 23, 2016

Share this post:

Reading time:

Nowadays, in Google Cloud Engine (GCE), it is possible to attach a local SSD with the NVMe interface to your virtual machine. Unfortunately, you only get a good number of iops (input/output operations per second) if you instantiate a machine with nvme-backports-debian-7-wheezy image; other available distributions on GCE will have a lower number of iops.
 
It turns out that Google's Virtual Machine Monitor (aka Hypervisor) implements a custom NVMe command that allows it to increase up to 4 times the number of  iops (note: this is from what I've tested so far, but it seems to be possible to get up to 5 times faster according to the original commit message; check the Technical Details sessions to see how this is possible), however the kernel you use needs to support it and this is not yet the case with the mainline kernel.
 
This is not exclusive to GCE as Google released a patch not only to the kernel  but also to the qemu and is available here.
 
Collabora has been helping update, refactor and review the patches to the Linux Kernel to send it upstream, however since this is not yet an official nvme standard, it shouldn't be merged into Kernel mainline, as its specification may still receive changes.
 
Seeing as it considerably increases performance, the feature is in the process of being discussed and proposed to the NVMe workgroup with Collabora's help.
 
While the nvmexpress.org seems interested in adding an official extension to stardarize it, as published in the mailing list, nothing has been defined yet, as this is a very recent discussion and it can take up to a year to be ratified by the NVMe workgroup.
 
So, for the time being, you can get a more recent version of the patch and install the driver yourself here: https://git.collabora.com/cgit/user/koike/linux.git/log/?h=nvme/dev
 

How it works

Technical details

The NVMe interface basicaly works with command queues. The drive writes a command in a region known to both (driver and device controller) and then updates the tail of the queue, writting to an MMIO register called doorbell.
 
In an environment with several guest OSes on top of a VMM sharing a resource, communication between the guest OS and the real device is usually trapped by the VMM. As an MMIO is usually a syncronous acces to the device, it means that every MMIO access will cause a trap. 

Figure 1. Example of emulated device in the VMM

 

The main idea here is to decrease the number of traps to the VMM by reducing the number of writtes to the doorbells.
This is achieved in two ways:
 
    1) Batching; or
    2) Letting the VMM pull the current doorbell value when it is already in execution.

The first one is easy, we can wait X commands to be written in the queue to ring the doorbell.
 
The second one is a bit more complicated. The guest OS needs to inform the emulated device in the VMM where it can pull the doorbell values, and the emulated NVMe device needs to inform the guest OS that it can restart the counter of X.
 
This is what this new feature does:
 
It adds a new command in the NVMe interface where the driver can send to the NVMe device controller two memory buffers:

1) A buffer where the real doorbell values are: Instead of writting to the MMIO  doorbell, the driver writtes the value in this buffer; and
2) Another buffer with a hint from the controller about how many commands the driver can write in the queue without ringing the doorbell.
 
The exact technical details may still change in the future, especially on how to properly implement the second item above. It is also very likely that Google's patches won't be compliant with the future ratified standard.
 
For the time being though, you can use the Collabora tree. Please let me know if you have any comments/feedback!


Original post

Related Posts

Related Posts

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Automatic regression handling and reporting for the Linux Kernel

14/03/2024

In continuation with our series about Kernel Integration we'll go into more detail about how regression detection, processing, and tracking…

Almost a fully open-source boot chain for Rockchip's RK3588!

21/02/2024

Now included in our Debian images & available via our GitLab, you can build a complete, working BL31 (Boot Loader stage 3.1), and replace…

What's the latest with WirePlumber?

19/02/2024

Back in 2022, after a series of issues were found in its design, I made the call to rework some of WirePlumber's fundamentals in order to…

DRM-CI: A GitLab-CI pipeline for Linux kernel testing

08/02/2024

Continuing our Kernel Integration series, we're excited to introduce DRM-CI, a groundbreaking solution that enables developers to test their…

Persian Rug, Part 4 - The limitations of proxies

23/01/2024

This is the fourth and final part in a series on persian-rug, a Rust crate for interconnected objects. We've touched on the two big limitations:…

How to share code between Vulkan and Gallium

16/01/2024

One of the key high-level challenges of building Mesa drivers these days is figuring out how to best share code between a Vulkan driver…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2024. All rights reserved. Privacy Notice. Sitemap.