The futex_waitv() syscall and gaming on Linux

The futex_waitv() syscall and gaming on Linux

Muhammad Usama Anjum
February 17, 2023

Share this post:

Reading time:

Just over a year has passed since the futex_waitv() syscall, part of the new futex2 systems calls, landed in Linux 5.16. But why are both futex2 and futex_waitv needed? What role do they play in the context gaming on Linux? Let's find out by taking a closer look at the futex syscall.

Futex() syscall internals

Sometimes, tasks compete to get a futex. The tasks that have lost the race and are unable to acquire the futex immediately are put to sleep until the futex key changes. The kernel needs to keep track of which process is waiting on which futex, so it knows who to wake up when a wake operation happens.

The kernel uses a hash table for finding the tasks waiting on each futex. Each bucket has priority queues to store the information about the task. The bucket is found in constant time O(1) and the task is searched inside the bucket in linear time, O(k) where k is the number of tasks in the bucket. JHash from Bob Jenkins is used as a hashing function.

The hash function maps keys to the bucket index in the hash table. As the futex address is private to the address space of a process, it can be the same for more than one process. Hence, the address of the futex cannot be used directly as the key for hashing. Instead, the hash key is calculated based on the address of the futex, the offset of the futex address from the page boundary, and the address of the task's memory management structure or node pointer. Thus, this key is always unique for each futex.

FUTEX_WAIT Operation

function futex_wait(uaddr, val, ...)

    get_futex_key(uaddr, &key, ...)
    hb = futex_hash(key);
    lock_hash_bucket(hb)

    futex_get_value_locked(&uval, uaddr)
    if uaddr != uval:
        unlock_hash_bucket(hb)
        return -EWOULDBLOCK;

    set_current_state(TASK_INTERRUPTIBLE);
    futex_queue(q, hb)
    unlock_hash_bucket(hb)
    freezable_schedule() // sleeps the current task

    set_current_state(TASK_RUNNING) // task has woken up
    futex_unqueue(q)

When the FUTEX_WAIT operation is executed, the kernel calculates the hash of the futex and finds the hash bucket. The bucket lock is acquired, the futex word is read, and compared to the expected value provided by userspace. The expectation check and sleep are made atomically by taking the bucket lock. If the futex word doesn't have the expected value, the error -EAGAIN is returned. The purpose of the comparison with the expected value is to prevent lost wake-ups. If another thread changed the value of the futex word after the calling thread decided to block based on the prior value, or if the other thread executed a FUTEX_WAKE operation after the value change and before this FUTEX_WAIT operation, then the calling thread will observe the value change and will not go to sleep.

If the futex word matches the expected value, the task is inserted in this bucket and the task sleeps, waiting for the FUTEX_WAKE operation from another thread to happen. More than one task can wait at the same futex. Each futex adds new task information in the same hash bucket, which is an expected collision. Hash collisions can also happen for different futexes resulting in the storing of tasks waiting on them in the same hash bucket.

FUTEX_WAKE Operation

function futex_wake(uaddr, nr_wake, ...)

    n = 0;
    get_futex_key(uaddr, &key, ...)
    hb = futex_hash(key)

    lock_hash_bucket(hb)
    for tasks : hash_bucket
        if key matches: 
            futex_wake_mark(wake_queue, task)
            n++;
        if n >= nr_wake
            break
    unlock_hash_bucket(hb)

    wake_tasks(wake_queue)

The number of tasks to be awoken is specified when calling the FUTEX_WAKE operation. Usually, the number of tasks is specified to wake only the number of processes which can acquire the futex and avoids waking all the processes while only one can acquire the futex. This helps to avoid the thundering heard problem. The kernel calculates the key, finds the hash bucket and wakes up the specified number of tasks waiting on this specific futex from this bucket list.

Limitations of the futex() syscall

The wait can only be done on one futex at a time. This cannot work for the cases where the task wants to wait for any of several objects to become available, i.e. the task wakes up when one of the objects becomes available from the list of the objects. The eventfd() syscall can be used to wait on multiple objects, but syscall invocation happens several times. Reducing the number of syscalls can give better performance.
The futex syscall interface is complex because of the multiplexing of a lot of operations. More features cannot be added to the same syscall from a maintainability perspective.
Only the futex word of size 32-bit is supported. It helps reduce the memory footprint, especially when many futexes are used. Other sizes, such as 64-bit, can be useful for tracking pointers.
The implementation of the futex isn't NUMA aware. NUMA systems are made up of multiple nodes. Access to the same CPU cache is fast, but moving cache lines between the nodes is expensive. So accessing the Global Hash Table of the futexes becomes costly to access from different nodes.

The eventfd() syscall

The futex syscall has the limitation that the thread can wait for only one futex. What if the process wants to wait on several objects and wants to wake up when any of the objects are free? These semantics can be achieved by using eventfd syscall. The eventfd syscall creates an "eventfd file descriptor" that can be used as an event wait/notify mechanism by user-space applications, and by the kernel to notify user-space applications of events. The eventfd syscall creates a file descriptor for event notification. This fd can be read, written, and polled, which enables listening on multiple fds at a time.

This polling on several eventfd contexts with semaphore semantics would provide similar semantics. However, with sufficient producers and consumers, the eventfd interface itself becomes a bottleneck because each thread will compete to acquire a sequence of wait queue locks for each eventfd context in the poll list. The eventfd operates on file descriptors, a limited resource that has shown its limitation in many use cases. Despite the Windows interface not waiting on more than 64 objects at once, there can still be multiple waiters simultaneously, and the limits of file descriptors on applications like games can be quickly exhausted.

The total number of syscalls for this kind of operation can be reduced to get better performance if there is a syscall to wait on multiple objects at the same time. This is why a dedicated futex_waitv syscall has been introduced.

Introduction of futex_waitv syscall

The futex_waitv syscall is a new syscall through which the process can wait for multiple futexes. The task wakes up when any futex in the list is awakened. This can be used to implement wait on multiple locks and wait lists, etc, without the limitations imposed by using eventfd.

This kind of operation is used in games where a lot of objects are present. There are multiple consumers and multiple producers, all waiting on different subsets of objects. In an implementation using it, the calls of spin lock decrease by more than 80% in some user applications. Compared to a solution that uses eventfd syscall, futex_waitv is able to reduce CPU utilization for some games and even increase frames per second (FPS) for some games.

Wine

Every operating system provides an interface which is a set of APIs and syscalls. The applications use these interfaces to get support from the kernel. While many UNIX-like and other operating systems are POSIX compliant, Microsoft Windows isn't. Windows has its own syscalls and libraries. Thus applications that are made to run on top of it are only Windows-compatible.

Windows has been the default PC gaming platform for decades. The thousands of existing games cannot be ported to Linux because Windows is proprietary software. It can change policies at any time which can create disaster for long-running and profitable companies using Windows as their only platform. What if Windows games can be run on the free Linux operating system without porting or even recompiling? It would bring thousands of Windows applications to Linux. As Linux is open-source and anybody can use it, the perfect ecosystem can be created independent of a proprietary operating system.

Fortunately, it is possible. There is a compatibility layer provided by Wine to do so. Many Windows applications can be run easily with some settings and tweaks inside Wine. Wine implements Windows' system layout and provides a registry, libraries, and many other things that are needed. The Wine developers make a lot of effort to find the behavior of Windows APIs, as sometimes documentation isn't available and sometimes the behavior is undocumented. Hence even though Wine started in 1993, it doesn't run all but many Windows applications.

Proton

Valve, which is Steam's creator, has created an ecosystem where most Windows games can not only run on Linux but also with an equal or better performance than Windows. Proton is a collection of software and libraries combined with a patched version of Wine to improve performance and compatibility with Windows games. Proton uses several out-of-tree patches on top of Wine inside Proton. Proton incorporates several libraries that improve 3D performance. Being a fork of Wine, Proton maintains very similar compatibility with Windows applications as its upstream counterpart. In addition to the official list, many other Windows games are reportedly compatible, although unofficially with Proton. ProtonDB is an unofficial community website that collects and displays crowdsourced data describing the compatibility of a given game title with Proton.

Through Wine, Windows's APIs are emulated in terms of native library calls and syscalls of Linux. Sometimes Windows' API implementation in Linux requires help from the kernel to get good performance. But the kernel doesn't have the support for that particular feature. This is where code is written and added to the Linux kernel. One such example is the introduction of the futex_waitv() syscall.

Esync and Fsync

Zebediah Figura, a Wine and Proton developer, has summarized Esync and Fsync in a very precise way.

Windows has a set of synchronization primitives, essentially all of which are stateful and can be shared between processes. Because these objects can be shared between processes, Wine implements them using a dedicated 'wineserver' process. Each server request is a remote procedure call to the wineserver process wherein data is sent over a socket from the client to the server; the server performs some operation, and then sends a reply back to the client.

High-performance games make heavy use of multiple threads and synchronization primitives. RPC to wineserver is a bottleneck as it is single-threaded and can only service one request at a time. This is why a set of out-of-tree patches for Wine were written that implement these with the eventfd interface, with Windows kernel objects essentially mapping directly to eventfd objects, plus some extra state which lives in shared memory. To a large degree, this results in theoretically equal performance with Windows; most operations which are understood to comprise one system call on Windows now also comprise one system call (to the Linux kernel) on Wine. The actual performance improvement is quite substantial, depending on the application. However, "esync" has its problems. The eventfd doesn't provide interfaces for Windows kernel APIs, and we have to badly emulate them. As a result, some applications simply don't work with it.

Later, a second out-of-tree patch set was developed named "fsync," which uses futexes instead of eventfds. It was developed by request on the grounds that futexes can be faster than eventfds. In practice, there is sometimes a positive performance difference, sometimes a negative one, and often no measurable difference at all; it also sometimes makes performance much less consistent. It shares all of the same problems as "esync," including (to a degree) some inefficiencies when executing contended waits.

Both of these also rely on shared memory being mapped read/write into every application; as a result object state can easily be corrupted by a misbehaving process. These problems have led to its remaining out of tree. Some operations also need more than one system call and could theoretically be improved performance-wise. The developers are thinking and discussing how an upstreamable solution could be developed.

Game benchmarks

Sometimes Esync and Fsync improve the performance such that the number of frames per second (FPS) is improved as well. Gain in FPS is highly desired by gamers. Assassin's Creed Odyssey performs equally well with Esync and Fsync as compared to running without them. Ori and the Blind Forest: Definitive Edition, BioShock and BioShock 2 perform better with Esync. DOOM Eternal, Far Cry New Dawn, Overwatch, Red Dead Redemption 2, The Witcher 3, and World War Z perform better with Fsync.

Landing a new syscall: What is futex?

Kernel 5.16: A new release for a new year

First up in 2022: linux.conf.au!

Landing a new syscall: What is futex?

Kernel 5.16: A new release for a new year

First up in 2022: linux.conf.au!

Comments (12)

Claudio Sampaio (Patola):
Feb 18, 2023 at 02:42 AM

Wait, what? You said many great things about fsync and esync, but in the introduction you promised to tell why futex_waitv and futex2 are needed. Are they the equivalent of fsync and esync? Please clarify. Or finish the article!

Reply to this comment

Reply to this comment
1. kernel guy:
  Feb 20, 2023 at 03:56 PM
  
  They are not equivalent to fsync/esync at all. futex_waitv is a Linux kernel syscall. fsync is a userspace library inside wine/proton that *uses* futex_waitv to implement the emulation of Windows synchronization primitives. futex_waitv was originally designed with fsync in mind, but it has many other use cases outside of gaming and can be used by any linux program.
  
  esync is an older wine/proton library to solve the same windows emulation problem through another kernel interface called eventfd. esync suffers from high cpu usage and allegedly higher latency, which is the reason fsync was developed using the new kernel interface (futex_waitv).
  
  futex2 is a generic term related to the attempt of creating a new, more concise interface for futex in the kernel. The only thing that ever came out of it was futex_waitv.
  
  Reply to this comment
  
  Reply to this comment
  1. Patola (Cláudio Sampaio):
    Feb 20, 2023 at 04:18 PM
    
    That explains it, thank you. Especially the part about futex2 being a generic term. Now, if you don't mind me abusing your answer, there are tkg and zen kernels that still carry an fsync patch (described as e.g. v6.2-fsync1_via_futex_waitv.patch) even or 6.x kernels, that already have a futex2 implementation. Do you know the difference, and which one wine-proton uses when we run games? And why would this patch still be needed? I kind of followed the long history of attempts of Collabora to land a futex2 implementation in the kernel, I think the one that worked was the third attempt, am I right? Why weren't the previous ones accepted?
    
    Reply to this comment
    
    Reply to this comment
    1. kernel guy:
      Feb 20, 2023 at 04:43 PM
      
      Old Proton releases relied on this out-of-tree patch to enable fsync on custom devices even before futex_waitv was merged upstream. Newer Proton builds dropped support for this old interface long ago, and only use the upstream code. One reason these kernels still carry it might be to support older Proton builds. Although, It's unlikely someone still uses these several years old Proton builds. Most likely somebody just fogot to remove it from those kernels.
      
      I don't know how many attempts there were to merge it. open source development usually requires this kind of back and forth to get a piece of code as good as it can be before merging. it is very normal. several people worked on it, among reviewers and developers, to consolidate the interface we have today .
      
      Reply to this comment
      
      Reply to this comment
tony:
Feb 20, 2023 at 11:44 AM

https://www.youtube.com/watch?v=J42KBe1yQuE

Reply to this comment

Reply to this comment
1. kernel guy:
  Feb 20, 2023 at 04:13 PM
  
  did you in your turn? https://youtu.be/UDuUjMG6ac8?t=31
  
  Reply to this comment
  
  Reply to this comment
Hi-Angel:
Feb 27, 2024 at 06:25 AM

Strangely enough, I see no relevant matches in WINE sources while grepping for `fsync` and `futex_waitv`.

So… what gives? Was it never merged?

Reply to this comment

Reply to this comment
1. Muhammad Usama Anjum:
  Feb 28, 2024 at 03:24 PM
  
  While esync and fsync were providing better execution speed for several games, they had issues of their own. Unfortunately, they weren't accepted in upstream Wine and only remained in Proton. So we cannot have anything in Wine that doesn't work to cover all 100% of use cases.
  
  Reply to this comment
  
  Reply to this comment
  1. Patola:
    Feb 28, 2024 at 03:49 PM
    
    I thought fsync/futex2 worked in regular wine? And what about the newly announced fastsync and winesync? And what's the difference between them? There are kernel modules for both on Arch.
    
    Reply to this comment
    
    Reply to this comment
    1. Muhammad Usama Anjum:
      Feb 29, 2024 at 09:01 AM
      
      I'm not entirely in sync with what is being developed as a replacement for fsync. But there is an effort going on to add another primitive to Linux Kernel which can be used without any cons and having the same performance as fsync. I've not seen these patches sent to the Kernel yet. Maybe they have downstream implementations already for testing purposes and those are available in the Arch.
      
      Reply to this comment
      
      Reply to this comment
  2. Hi-Angel:
    Feb 28, 2024 at 03:54 PM
    
    @MuhammadUsamaAnjum hmm, I am confused, wasn't the point of the new FUTEX call (which is implemented by fsync), beside performance of course, is that it is closer to how Windows works, and so easier to maintain?
    
    Reply to this comment
    
    Reply to this comment
    1. Muhammad Usama Anjum:
      Feb 28, 2024 at 04:08 PM
      
      The Proton's Fsync was using the new futex_waitv() syscall under the hood. AFAIK the futuex_waitv() wasn't stateless which means that there was some state that needed to be maintained by Proton. It was possible to do it in Proton, but Wine would never accept something like this. Wine had an alternative implementation already where two-way messaging was involved to implement this which was too slow for games to use. That's why Esync/Fsync were given a shot which gave several games a major performance boost.
      
      The development happens in step by step manner. It takes time to build something, prove it useful to the Linux Kernel community, implement it in the kernel, and use it in Wine if it fulfills all the already working use cases without breaking anything.
      
      Reply to this comment
      
      Reply to this comment

Add a Comment

Search the newsroom

Latest Blog Posts

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Smart audio filters with WirePlumber 0.5

26/06/2024

WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…

The latest on cmtp-responder, a permissively-licensed MTP responder implementation

12/06/2024

Part 3 of the cmtp-responder series with a focus on USB gadgets explores several new elements including a unified build environment with…

A roadmap for VirtIO Video on ChromeOS: part 3

06/06/2024

The final installment of a series explaining how Collabora is helping shape the video virtualization story for Chromebooks with a focus…

Hacking on the PipeWire GStreamer elements

05/06/2024

Last week I attended the GStreamer spring hackfest in Thessaloniki to work on the PipeWire GStreamer elements and connect with the community.

Transforming speech technology with WhisperLive

28/05/2024

The world of AI has made leaps and bounds from what It once was, but there are still some adjustments required for the optimal outcome.…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기