Muhammad Usama Anjum
February 17, 2023
Just over a year has passed since the futex_waitv() syscall, part of the new futex2 systems calls, landed in Linux 5.16. But why are both futex2 and futex_waitv needed? What role do they play in the context gaming on Linux? Let's find out by taking a closer look at the futex syscall.
Sometimes, tasks compete to get a futex. The tasks that have lost the race and are unable to acquire the futex immediately are put to sleep until the futex key changes. The kernel needs to keep track of which process is waiting on which futex, so it knows who to wake up when a wake operation happens.
The kernel uses a hash table for finding the tasks waiting on each futex. Each bucket has priority queues to store the information about the task. The bucket is found in constant time
O(1) and the task is searched inside the bucket in linear time,
O(k) where k is the number of tasks in the bucket. JHash from Bob Jenkins is used as a hashing function.
The hash function maps keys to the bucket index in the hash table. As the futex address is private to the address space of a process, it can be the same for more than one process. Hence, the address of the futex cannot be used directly as the key for hashing. Instead, the hash key is calculated based on the address of the futex, the offset of the futex address from the page boundary, and the address of the task's memory management structure or node pointer. Thus, this key is always unique for each futex.
function futex_wait(uaddr, val, ...) get_futex_key(uaddr, &key, ...) hb = futex_hash(key); lock_hash_bucket(hb) futex_get_value_locked(&uval, uaddr) if uaddr != uval: unlock_hash_bucket(hb) return -EWOULDBLOCK; set_current_state(TASK_INTERRUPTIBLE); futex_queue(q, hb) unlock_hash_bucket(hb) freezable_schedule() // sleeps the current task set_current_state(TASK_RUNNING) // task has woken up futex_unqueue(q)
FUTEX_WAIT operation is executed, the kernel calculates the hash of the futex and finds the hash bucket. The bucket lock is acquired, the futex word is read, and compared to the expected value provided by userspace. The expectation check and sleep are made atomically by taking the bucket lock. If the futex word doesn't have the expected value, the error
-EAGAIN is returned. The purpose of the comparison with the expected value is to prevent lost wake-ups. If another thread changed the value of the futex word after the calling thread decided to block based on the prior value, or if the other thread executed a FUTEX_WAKE operation after the value change and before this FUTEX_WAIT operation, then the calling thread will observe the value change and will not go to sleep.
If the futex word matches the expected value, the task is inserted in this bucket and the task sleeps, waiting for the
FUTEX_WAKE operation from another thread to happen. More than one task can wait at the same futex. Each futex adds new task information in the same hash bucket, which is an expected collision. Hash collisions can also happen for different futexes resulting in the storing of tasks waiting on them in the same hash bucket.
function futex_wake(uaddr, nr_wake, ...) n = 0; get_futex_key(uaddr, &key, ...) hb = futex_hash(key) lock_hash_bucket(hb) for tasks : hash_bucket if key matches: futex_wake_mark(wake_queue, task) n++; if n >= nr_wake break unlock_hash_bucket(hb) wake_tasks(wake_queue)
The number of tasks to be awoken is specified when calling the
FUTEX_WAKE operation. Usually, the number of tasks is specified to wake only the number of processes which can acquire the futex and avoids waking all the processes while only one can acquire the futex. This helps to avoid the thundering heard problem. The kernel calculates the key, finds the hash bucket and wakes up the specified number of tasks waiting on this specific futex from this bucket list.
The futex syscall has the limitation that the thread can wait for only one futex. What if the process wants to wait on several objects and wants to wake up when any of the objects are free? These semantics can be achieved by using eventfd syscall. The eventfd syscall creates an "eventfd file descriptor" that can be used as an event wait/notify mechanism by user-space applications, and by the kernel to notify user-space applications of events. The eventfd syscall creates a file descriptor for event notification. This fd can be read, written, and polled, which enables listening on multiple fds at a time.
This polling on several eventfd contexts with semaphore semantics would provide similar semantics. However, with sufficient producers and consumers, the eventfd interface itself becomes a bottleneck because each thread will compete to acquire a sequence of wait queue locks for each eventfd context in the poll list. The eventfd operates on file descriptors, a limited resource that has shown its limitation in many use cases. Despite the Windows interface not waiting on more than 64 objects at once, there can still be multiple waiters simultaneously, and the limits of file descriptors on applications like games can be quickly exhausted.
The total number of syscalls for this kind of operation can be reduced to get better performance if there is a syscall to wait on multiple objects at the same time. This is why a dedicated futex_waitv syscall has been introduced.
The futex_waitv syscall is a new syscall through which the process can wait for multiple futexes. The task wakes up when any futex in the list is awakened. This can be used to implement wait on multiple locks and wait lists, etc, without the limitations imposed by using eventfd.
This kind of operation is used in games where a lot of objects are present. There are multiple consumers and multiple producers, all waiting on different subsets of objects. In an implementation using it, the calls of spin lock decrease by more than 80% in some user applications. Compared to a solution that uses eventfd syscall,
futex_waitv is able to reduce CPU utilization for some games and even increase frames per second (FPS) for some games.
Every operating system provides an interface which is a set of APIs and syscalls. The applications use these interfaces to get support from the kernel. While many UNIX-like and other operating systems are POSIX compliant, Microsoft Windows isn't. Windows has its own syscalls and libraries. Thus applications that are made to run on top of it are only Windows-compatible.
Windows has been the default PC gaming platform for decades. The thousands of existing games cannot be ported to Linux because Windows is proprietary software. It can change policies at any time which can create disaster for long-running and profitable companies using Windows as their only platform. What if Windows games can be run on the free Linux operating system without porting or even recompiling? It would bring thousands of Windows applications to Linux. As Linux is open-source and anybody can use it, the perfect ecosystem can be created independent of a proprietary operating system.
Fortunately, it is possible. There is a compatibility layer provided by Wine to do so. Many Windows applications can be run easily with some settings and tweaks inside Wine. Wine implements Windows' system layout and provides a registry, libraries, and many other things that are needed. The Wine developers make a lot of effort to find the behavior of Windows APIs, as sometimes documentation isn't available and sometimes the behavior is undocumented. Hence even though Wine started in 1993, it doesn't run all but many Windows applications.
Valve, which is Steam's creator, has created an ecosystem where most Windows games can not only run on Linux but also with an equal or better performance than Windows. Proton is a collection of software and libraries combined with a patched version of Wine to improve performance and compatibility with Windows games. Proton uses several out-of-tree patches on top of Wine inside Proton. Proton incorporates several libraries that improve 3D performance. Being a fork of Wine, Proton maintains very similar compatibility with Windows applications as its upstream counterpart. In addition to the official list, many other Windows games are reportedly compatible, although unofficially with Proton. ProtonDB is an unofficial community website that collects and displays crowdsourced data describing the compatibility of a given game title with Proton.
Through Wine, Windows's APIs are emulated in terms of native library calls and syscalls of Linux. Sometimes Windows' API implementation in Linux requires help from the kernel to get good performance. But the kernel doesn't have the support for that particular feature. This is where code is written and added to the Linux kernel. One such example is the introduction of the futex_waitv() syscall.
Zebediah Figura, a Wine and Proton developer, has summarized Esync and Fsync in a very precise way.
Windows has a set of synchronization primitives, essentially all of which are stateful and can be shared between processes. Because these objects can be shared between processes, Wine implements them using a dedicated 'wineserver' process. Each server request is a remote procedure call to the wineserver process wherein data is sent over a socket from the client to the server; the server performs some operation, and then sends a reply back to the client.
High-performance games make heavy use of multiple threads and synchronization primitives. RPC to wineserver is a bottleneck as it is single-threaded and can only service one request at a time. This is why a set of out-of-tree patches for Wine were written that implement these with the eventfd interface, with Windows kernel objects essentially mapping directly to eventfd objects, plus some extra state which lives in shared memory. To a large degree, this results in theoretically equal performance with Windows; most operations which are understood to comprise one system call on Windows now also comprise one system call (to the Linux kernel) on Wine. The actual performance improvement is quite substantial, depending on the application. However, "esync" has its problems. The eventfd doesn't provide interfaces for Windows kernel APIs, and we have to badly emulate them. As a result, some applications simply don't work with it.
Later, a second out-of-tree patch set was developed named "fsync," which uses futexes instead of eventfds. It was developed by request on the grounds that futexes can be faster than eventfds. In practice, there is sometimes a positive performance difference, sometimes a negative one, and often no measurable difference at all; it also sometimes makes performance much less consistent. It shares all of the same problems as "esync," including (to a degree) some inefficiencies when executing contended waits.
Both of these also rely on shared memory being mapped read/write into every application; as a result object state can easily be corrupted by a misbehaving process. These problems have led to its remaining out of tree. Some operations also need more than one system call and could theoretically be improved performance-wise. The developers are thinking and discussing how an upstreamable solution could be developed.
Sometimes Esync and Fsync improve the performance such that the number of frames per second (FPS) is improved as well. Gain in FPS is highly desired by gamers. Assassin's Creed Odyssey performs equally well with Esync and Fsync as compared to running without them. Ori and the Blind Forest: Definitive Edition, BioShock and BioShock 2 perform better with Esync. DOOM Eternal, Far Cry New Dawn, Overwatch, Red Dead Redemption 2, The Witcher 3, and World War Z perform better with Fsync.
Now included in our Debian images & available via our GitLab, you can build a complete, working BL31 (Boot Loader stage 3.1), and replace…
Back in 2022, after a series of issues were found in its design, I made the call to rework some of WirePlumber's fundamentals in order to…
Continuing our Kernel Integration series, we're excited to introduce DRM-CI, a groundbreaking solution that enables developers to test their…
This is the fourth and final part in a series on persian-rug, a Rust crate for interconnected objects. We've touched on the two big limitations:…
One of the key high-level challenges of building Mesa drivers these days is figuring out how to best share code between a Vulkan driver…
Google Open Source have chosen their second group of winners for the 2023 Google Open Source Peer Bonus Program, and Arnaud Ferraris, Senior…