We're hiring!
*

The futex_waitv() syscall and gaming on Linux

Muhammad Usama Anjum avatar

Muhammad Usama Anjum
February 17, 2023

Share this post:

Reading time:

Just over a year has passed since the futex_waitv() syscall, part of the new futex2 systems calls, landed in Linux 5.16. But why are both futex2 and futex_waitv needed? What role do they play in the context gaming on Linux? Let's find out by taking a closer look at the futex syscall.

Futex() syscall internals

Sometimes, tasks compete to get a futex. The tasks that have lost the race and are unable to acquire the futex immediately are put to sleep until the futex key changes. The kernel needs to keep track of which process is waiting on which futex, so it knows who to wake up when a wake operation happens.

The kernel uses a hash table for finding the tasks waiting on each futex. Each bucket has priority queues to store the information about the task. The bucket is found in constant time O(1) and the task is searched inside the bucket in linear time, O(k) where k is the number of tasks in the bucket. JHash from Bob Jenkins is used as a hashing function.

The hash function maps keys to the bucket index in the hash table. As the futex address is private to the address space of a process, it can be the same for more than one process. Hence, the address of the futex cannot be used directly as the key for hashing. Instead, the hash key is calculated based on the address of the futex, the offset of the futex address from the page boundary, and the address of the task's memory management structure or node pointer. Thus, this key is always unique for each futex.

FUTEX_WAIT Operation

function futex_wait(uaddr, val, ...)

    get_futex_key(uaddr, &key, ...)
    hb = futex_hash(key);
    lock_hash_bucket(hb)

    futex_get_value_locked(&uval, uaddr)
    if uaddr != uval:
        unlock_hash_bucket(hb)
        return -EWOULDBLOCK;

    set_current_state(TASK_INTERRUPTIBLE);
    futex_queue(q, hb)
    unlock_hash_bucket(hb)
    freezable_schedule() // sleeps the current task

    set_current_state(TASK_RUNNING) // task has woken up
    futex_unqueue(q)

When the FUTEX_WAIT operation is executed, the kernel calculates the hash of the futex and finds the hash bucket. The bucket lock is acquired, the futex word is read, and compared to the expected value provided by userspace. The expectation check and sleep are made atomically by taking the bucket lock. If the futex word doesn't have the expected value, the error -EAGAIN is returned. The purpose of the comparison with the expected value is to prevent lost wake-ups. If another thread changed the value of the futex word after the calling thread decided to block based on the prior value, or if the other thread executed a FUTEX_WAKE operation after the value change and before this FUTEX_WAIT operation, then the calling thread will observe the value change and will not go to sleep.

If the futex word matches the expected value, the task is inserted in this bucket and the task sleeps, waiting for the FUTEX_WAKE operation from another thread to happen. More than one task can wait at the same futex. Each futex adds new task information in the same hash bucket, which is an expected collision. Hash collisions can also happen for different futexes resulting in the storing of tasks waiting on them in the same hash bucket.

FUTEX_WAKE Operation

function futex_wake(uaddr, nr_wake, ...)

    n = 0;
    get_futex_key(uaddr, &key, ...)
    hb = futex_hash(key)

    lock_hash_bucket(hb)
    for tasks : hash_bucket
        if key matches: 
            futex_wake_mark(wake_queue, task)
            n++;
        if n >= nr_wake
            break
    unlock_hash_bucket(hb)

    wake_tasks(wake_queue)

The number of tasks to be awoken is specified when calling the FUTEX_WAKE operation. Usually, the number of tasks is specified to wake only the number of processes which can acquire the futex and avoids waking all the processes while only one can acquire the futex. This helps to avoid the thundering heard problem. The kernel calculates the key, finds the hash bucket and wakes up the specified number of tasks waiting on this specific futex from this bucket list.

Limitations of the futex() syscall

  • The wait can only be done on one futex at a time. This cannot work for the cases where the task wants to wait for any of several objects to become available, i.e. the task wakes up when one of the objects becomes available from the list of the objects. The eventfd() syscall can be used to wait on multiple objects, but syscall invocation happens several times. Reducing the number of syscalls can give better performance.
  • The futex syscall interface is complex because of the multiplexing of a lot of operations. More features cannot be added to the same syscall from a maintainability perspective.
  • Only the futex word of size 32-bit is supported. It helps reduce the memory footprint, especially when many futexes are used. Other sizes, such as 64-bit, can be useful for tracking pointers.
  • The implementation of the futex isn't NUMA aware. NUMA systems are made up of multiple nodes. Access to the same CPU cache is fast, but moving cache lines between the nodes is expensive. So accessing the Global Hash Table of the futexes becomes costly to access from different nodes.

The eventfd() syscall

The futex syscall has the limitation that the thread can wait for only one futex. What if the process wants to wait on several objects and wants to wake up when any of the objects are free? These semantics can be achieved by using eventfd syscall. The eventfd syscall creates an "eventfd file descriptor" that can be used as an event wait/notify mechanism by user-space applications, and by the kernel to notify user-space applications of events. The eventfd syscall creates a file descriptor for event notification. This fd can be read, written, and polled, which enables listening on multiple fds at a time.

This polling on several eventfd contexts with semaphore semantics would provide similar semantics. However, with sufficient producers and consumers, the eventfd interface itself becomes a bottleneck because each thread will compete to acquire a sequence of wait queue locks for each eventfd context in the poll list. The eventfd operates on file descriptors, a limited resource that has shown its limitation in many use cases. Despite the Windows interface not waiting on more than 64 objects at once, there can still be multiple waiters simultaneously, and the limits of file descriptors on applications like games can be quickly exhausted.

The total number of syscalls for this kind of operation can be reduced to get better performance if there is a syscall to wait on multiple objects at the same time. This is why a dedicated futex_waitv syscall has been introduced.

Introduction of futex_waitv syscall

The futex_waitv syscall is a new syscall through which the process can wait for multiple futexes. The task wakes up when any futex in the list is awakened. This can be used to implement wait on multiple locks and wait lists, etc, without the limitations imposed by using eventfd.

This kind of operation is used in games where a lot of objects are present. There are multiple consumers and multiple producers, all waiting on different subsets of objects. In an implementation using it, the calls of spin lock decrease by more than 80% in some user applications. Compared to a solution that uses eventfd syscall, futex_waitv is able to reduce CPU utilization for some games and even increase frames per second (FPS) for some games.

Wine

Every operating system provides an interface which is a set of APIs and syscalls. The applications use these interfaces to get support from the kernel. While many UNIX-like and other operating systems are POSIX compliant, Microsoft Windows isn't. Windows has its own syscalls and libraries. Thus applications that are made to run on top of it are only Windows-compatible.

Windows has been the default PC gaming platform for decades. The thousands of existing games cannot be ported to Linux because Windows is proprietary software. It can change policies at any time which can create disaster for long-running and profitable companies using Windows as their only platform. What if Windows games can be run on the free Linux operating system without porting or even recompiling? It would bring thousands of Windows applications to Linux. As Linux is open-source and anybody can use it, the perfect ecosystem can be created independent of a proprietary operating system.

Fortunately, it is possible. There is a compatibility layer provided by Wine to do so. Many Windows applications can be run easily with some settings and tweaks inside Wine. Wine implements Windows' system layout and provides a registry, libraries, and many other things that are needed. The Wine developers make a lot of effort to find the behavior of Windows APIs, as sometimes documentation isn't available and sometimes the behavior is undocumented. Hence even though Wine started in 1993, it doesn't run all but many Windows applications.

Proton

Valve, which is Steam's creator, has created an ecosystem where most Windows games can not only run on Linux but also with an equal or better performance than Windows. Proton is a collection of software and libraries combined with a patched version of Wine to improve performance and compatibility with Windows games. Proton uses several out-of-tree patches on top of Wine inside Proton. Proton incorporates several libraries that improve 3D performance. Being a fork of Wine, Proton maintains very similar compatibility with Windows applications as its upstream counterpart. In addition to the official list, many other Windows games are reportedly compatible, although unofficially with Proton. ProtonDB is an unofficial community website that collects and displays crowdsourced data describing the compatibility of a given game title with Proton.

Through Wine, Windows's APIs are emulated in terms of native library calls and syscalls of Linux. Sometimes Windows' API implementation in Linux requires help from the kernel to get good performance. But the kernel doesn't have the support for that particular feature. This is where code is written and added to the Linux kernel. One such example is the introduction of the futex_waitv() syscall.

Esync and Fsync

Zebediah Figura, a Wine and Proton developer, has summarized Esync and Fsync in a very precise way.

Windows has a set of synchronization primitives, essentially all of which are stateful and can be shared between processes. Because these objects can be shared between processes, Wine implements them using a dedicated 'wineserver' process. Each server request is a remote procedure call to the wineserver process wherein data is sent over a socket from the client to the server; the server performs some operation, and then sends a reply back to the client.

High-performance games make heavy use of multiple threads and synchronization primitives. RPC to wineserver is a bottleneck as it is single-threaded and can only service one request at a time. This is why a set of out-of-tree patches for Wine were written that implement these with the eventfd interface, with Windows kernel objects essentially mapping directly to eventfd objects, plus some extra state which lives in shared memory. To a large degree, this results in theoretically equal performance with Windows; most operations which are understood to comprise one system call on Windows now also comprise one system call (to the Linux kernel) on Wine. The actual performance improvement is quite substantial, depending on the application. However, "esync" has its problems. The eventfd doesn't provide interfaces for Windows kernel APIs, and we have to badly emulate them. As a result, some applications simply don't work with it.

Later, a second out-of-tree patch set was developed named "fsync," which uses futexes instead of eventfds. It was developed by request on the grounds that futexes can be faster than eventfds. In practice, there is sometimes a positive performance difference, sometimes a negative one, and often no measurable difference at all; it also sometimes makes performance much less consistent. It shares all of the same problems as "esync," including (to a degree) some inefficiencies when executing contended waits.

Both of these also rely on shared memory being mapped read/write into every application; as a result object state can easily be corrupted by a misbehaving process. These problems have led to its remaining out of tree. Some operations also need more than one system call and could theoretically be improved performance-wise. The developers are thinking and discussing how an upstreamable solution could be developed.

Game benchmarks

Sometimes Esync and Fsync improve the performance such that the number of frames per second (FPS) is improved as well. Gain in FPS is highly desired by gamers. Assassin's Creed Odyssey performs equally well with Esync and Fsync as compared to running without them. Ori and the Blind Forest: Definitive Edition, BioShock and BioShock 2 perform better with Esync. DOOM Eternal, Far Cry New Dawn, Overwatch, Red Dead Redemption 2, The Witcher 3, and World War Z perform better with Fsync.

Comments (6)

  1. Claudio Sampaio (Patola):
    Feb 18, 2023 at 02:42 AM

    Wait, what? You said many great things about fsync and esync, but in the introduction you promised to tell why futex_waitv and futex2 are needed. Are they the equivalent of fsync and esync? Please clarify. Or finish the article!

    Reply to this comment

    Reply to this comment

    1. kernel guy:
      Feb 20, 2023 at 03:56 PM

      They are not equivalent to fsync/esync at all. futex_waitv is a Linux kernel syscall. fsync is a userspace library inside wine/proton that *uses* futex_waitv to implement the emulation of Windows synchronization primitives. futex_waitv was originally designed with fsync in mind, but it has many other use cases outside of gaming and can be used by any linux program.

      esync is an older wine/proton library to solve the same windows emulation problem through another kernel interface called eventfd. esync suffers from high cpu usage and allegedly higher latency, which is the reason fsync was developed using the new kernel interface (futex_waitv).

      futex2 is a generic term related to the attempt of creating a new, more concise interface for futex in the kernel. The only thing that ever came out of it was futex_waitv.

      Reply to this comment

      Reply to this comment

      1. Patola (Cláudio Sampaio):
        Feb 20, 2023 at 04:18 PM

        That explains it, thank you. Especially the part about futex2 being a generic term. Now, if you don't mind me abusing your answer, there are tkg and zen kernels that still carry an fsync patch (described as e.g. v6.2-fsync1_via_futex_waitv.patch) even or 6.x kernels, that already have a futex2 implementation. Do you know the difference, and which one wine-proton uses when we run games? And why would this patch still be needed? I kind of followed the long history of attempts of Collabora to land a futex2 implementation in the kernel, I think the one that worked was the third attempt, am I right? Why weren't the previous ones accepted?

        Reply to this comment

        Reply to this comment

        1. kernel guy:
          Feb 20, 2023 at 04:43 PM

          Old Proton releases relied on this out-of-tree patch to enable fsync on custom devices even before futex_waitv was merged upstream. Newer Proton builds dropped support for this old interface long ago, and only use the upstream code. One reason these kernels still carry it might be to support older Proton builds. Although, It's unlikely someone still uses these several years old Proton builds. Most likely somebody just fogot to remove it from those kernels.

          I don't know how many attempts there were to merge it. open source development usually requires this kind of back and forth to get a piece of code as good as it can be before merging. it is very normal. several people worked on it, among reviewers and developers, to consolidate the interface we have today .

          Reply to this comment

          Reply to this comment


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Almost a fully open-source boot chain for Rockchip's RK3588!

21/02/2024

Now included in our Debian images & available via our GitLab, you can build a complete, working BL31 (Boot Loader stage 3.1), and replace…

What's the latest with WirePlumber?

19/02/2024

Back in 2022, after a series of issues were found in its design, I made the call to rework some of WirePlumber's fundamentals in order to…

DRM-CI: A GitLab-CI pipeline for Linux kernel testing

08/02/2024

Continuing our Kernel Integration series, we're excited to introduce DRM-CI, a groundbreaking solution that enables developers to test their…

Persian Rug, Part 4 - The limitations of proxies

23/01/2024

This is the fourth and final part in a series on persian-rug, a Rust crate for interconnected objects. We've touched on the two big limitations:…

How to share code between Vulkan and Gallium

16/01/2024

One of the key high-level challenges of building Mesa drivers these days is figuring out how to best share code between a Vulkan driver…

Google Open Source Peer Bonus 2023

19/12/2023

Google Open Source have chosen their second group of winners for the 2023 Google Open Source Peer Bonus Program, and Arnaud Ferraris, Senior…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2024. All rights reserved. Privacy Notice. Sitemap.