We're hiring!

Landing a new syscall, part 1: What is futex?

André Almeida avatar

André Almeida
February 08, 2022

Share this post:

Over the past 18 months, we have been on a roller-coaster ride developing futex2, a new set of system calls. As part of this prolonged effort, the futex_waitv() syscall has now successfully landed in Linux 5.16.

A followup of the initial futex syscall, this new interface aims to overcome long term issues that have been limiting the way applications use the Linux kernel. But what exactly is futex? This series of blog posts will help answer that and other questions around this tricky function.

If you've ever run strace in a multithread program, chances are that the trace was filled with futex() calls. If you are a Linux gamer trying to increase the performance of your setup, you have probably heard of futex as well.

Read on as I take a deep dive into this important system call and how it is used to process synchronization functions.

Data racing

Before getting into futex, we need to know which problem it solves: race condition. Since the arrival of multitasking operating systems, different tasks can now work at the same time and compete for the same resources. Using the same data at the same moment can lead to serious bugs as we will see next.

Imagine that your bank has a multithreading software to manage your account running in a single core, but failed to correctly use synchronization mechanisms. Let's say that you have 500 moneys in the bank. You want to withdraw 300 moneys and at the same time your friend wants to send you 200 moneys. Then, in the fast multithread bank system, this happens:

Thread 1

aux = account_balance 	//  500
aux += receive_money   	//  500 + 200
account_balance = aux 	//  700

Thread 2

aux = account_balance 	//  700
aux -= withdraw_money	//  700 - 300
account_balance = aux	//  400

All good! Everything went as expected. The kernel can order the threads in different manners. In this case, if Thread 2 went before Thread 1 the final result would be the same anyway:

Thread 2

aux = account_balance 	//  500
aux -= withdraw_money	//  500 - 300
account_balance = aux	//  200

Thread 1

aux = account_balance 	//  200
aux += receive_money   	//  200 + 200
account_balance = aux 	//  400

However, the scheduler preemption can interleave the execution of both threads, resulting in something like this:

Thread 1

aux = account_balance 	//  500
aux += receive_money   	//  500 + 200

Thread 2

aux = account_balance 	//  500
aux -= withdraw_money	//  500 - 300
account_balance = aux	//  200

And back to Thread 1

account_balance = aux 	//  700

Given that both threads were sharing the same resource (account_balance) and aren't using synchronization mechanisms, they failed to keep data consistence. Thread 2 writes 200 to account_balance, but just after Thread 1 overwrites it with 700.

If we were in a multiprocessor machine, this issue would be even worse given that these threads can run in concurrency.

But it's not that bad, right? You got more moneys! However, the opposite scheduling could happen as well, keeping you with only 200 moneys in the bank. So how do we solve this?

Mutual exclusion: Mutex

Mutex is a sync primitive that helps you make sure only one thread can access a critical section (hence the name, mutual exclusion).

Imagine that there's a room that can only be accessed by one person at a time. You get the key for the room, access it, and lock the door. When exiting, you leave the key for the next person. This is how a mutex works: you lock to use access the critical section and then unlock after using it. If when you want to lock it's already locked, then you need to patiently wait for another thread to open the door and give you the key.

In our bank example, the code could be fixed like this:

aux = account_balance 
account_balance = aux

This would then make any thread changing the account balance do so exclusively, without racing to other threads. Even if the thread gets scheduled with the lock, any other thread that tries to change the account_balance would face the mutex_lock() and would have to wait until someone calls mutex_unlock().

There are several other sync mechanisms out there, like spinlocks, semaphores and barriers, but in this article we are going to focus just on mutexes.

"Never write your own locks"

The goal of a mutex is to enable multiple threads to do their work, so ideally we should not impose high overheads with our mutexes. They should be super fast when the lock is not taken and ensure that tasks waiting to get the lock don't waste CPU time.

So how is a userspace mutex implemented? There are different approaches to it, as operating systems and programming libraries can take different directions. But if I would implement my own, I would make sure to use the kernel. After all, only the kernel can put a thread to sleep and awaken it when asked.

If we use the kernel, we can figure out a nice interface that takes care of a value for us. That would then allow us to:

  • Check if the lock is taken. If it is, sleep;
  • If it is not taken, take it;
  • And when releasing the lock, wake anyone waiting for it.

However, doing syscalls requires expensive context switches (and costs even more in the face of Meltdown). Don't forget that our goal is to be as fast as possible in the common case, which is when the lock is not taken. This is where futex design comes in.

Futex: fast userspace mutex

Instead of asking for the kernel to take care of the mutex value for us, we can keep track of it in userspace. Then, depending on the value, we can use the kernel only if we need to sleep or to wake other threads.

This is the main design feature of the futex. Most of the time, it doesn't need the kernel, so we have less context switch and can have a lightweight mutex implementation. However, to make this work, the platform needs to support atomic operations in order to ensure the mutex key is atomically modified. Let's see what the code would look like for such a situation.

First, let's define values for our mutex state. Futex can only operate with unsigned int of 32bits:

#define UNLOCKED 0
#define LOCKED 1

Now, here's an implementation for the mutex_lock(). Note, however, that a correct mutex is slightly more complex than this example. A complete mutex would enable us to avoid going into the kernel when possible, and would be able to prevent mutual starvation between two threads acquiring the lock. The example below is therefore only for didactic purposes, to help demonstrate how a mutex works.

mutex_lock(unsigned int *mutex)
	while (!cmpxchg(mutex, UNLOCKED, LOCKED))
		futex(mutex, FUTEX_WAIT, LOCKED);

Breaking down the code:

Here, our mutex is just a simple uint. cmpxchg() is an atomic compare-and-exchange operation, where if the mutex value is UNLOCKED, it replaces it by LOCKED. It returns true if the operation succeeded or false otherwise. If it returns true, it means that the lock was free and we took it. Otherwise, the lock was already taken by someone else and we need to wait for it to get freed. Now we call the futex() syscall. We use mutex as the first argument, since the memory address of our value is the identifier.

The second argument is FUTEX_WAIT, that's the opcode to wait on this address until something wakes us (you can optionally use a timeout argument as well). The third argument is the value that we expect to find as the value of the mutex variable. If the value at mutex is different from that, the syscall returns immediately with an -EAGAIN error. We need to set that because between the cmpxchg() and futex() the lock could be freed and if we wait in a free lock, we will probably wait forever.

We need to do this in a "while loop" because when waking up we might race with another thread trying to acquire the lock, and might need to wait again.

On the unlock side, we have this:

mutex_unlock(unsigned int *mutex)
	atomic_set(mutex, UNLOCKED);
	futex(mutex, FUTEX_WAKE, 1);

We atomically set the value as UNLOCKED, making it free to anyone that wants to take it and issue a futex() call; using the FUTEX_WAKE operation with 1 as an argument. That means that we will wake someone that is waiting in the mutex address. We could optimize this code to have more than two states and have a way to specify that the mutex is locked and that there is someone waiting for it. If there is no one waiting for it, then we don't need to call futex() in the unlock function. That way we achieve a very fast mutex with no syscalls needed for the common case, that's where the lock is not taken.


As you can see, futexes are tricky. Hopefully, if everything works as expected, software developers won't need to know that it exists, even if it's constantly being used by multithread applications.

At a high level, the futex syscall can be described as providing a kernel side wait queue indexed by a userspace address, to which the userspace can add or remove new threads from.

In the next post, we are going to dive into the internal implementation of the futex syscall, its limitations, and why a new interface is being proposed.

Comments (4)

  1. Joao Pereira:
    Feb 09, 2022 at 02:12 PM

    Typo on the data race code example: thread 2 should be 500-300=200

    Reply to this comment

    Reply to this comment

    1. Mark Filion:
      Feb 09, 2022 at 03:15 PM

      Good catch, thanks for letting us know. Typo has been fixed!

      Reply to this comment

      Reply to this comment

  2. Ian Taylor:
    Feb 10, 2022 at 01:00 AM

    What if there is more than one thread waiting for the unlock. How do ensure they are given priority? Is there a "take a number and be served in turn" mechanism?

    Reply to this comment

    Reply to this comment

    1. André Almeida:
      Feb 10, 2022 at 04:59 PM

      For the normal FUTEX_WAKE operation there are no priority, as per man page "No guarantee is provided about which waiters are awoken (e.g., a waiter with a higher scheduling priority is not guaranteed to be awoken in preference to a waiter with a lower priority)."

      However, for some use cases FUTEX_WAKE_OP and FUTEX_PI can be helpful.

      Reply to this comment

      Reply to this comment

Add a Comment

Allowed tags: <b><i><br>Add a new comment:

Search the newsroom

Latest Blog Posts

Visual-inertial tracking for Monado


Monado now has initial support for 6DoF ("inside-out") tracking for devices with cameras and an IMU! Three free and open source SLAM/VIO…

Spotlight on Meson's full-featured developer environment


When developing an application or a library, it is very common to want to run it without installing it, or to install it into a custom prefix…

How to write a Vulkan driver in 2022


An incredible amount has changed in Mesa and in the Vulkan ecosystems since we wrote the first Vulkan driver in Mesa for Intel hardware…

Improving the reliability of file system monitoring tools


Every file system used in production has tools to try to recover from system crashes. To provide a better infrastructure for those tools,…

PipeWire: A year in review & a look ahead


The PipeWire project made major strides over the past few years, bringing shiny new features, and paving the way for new possibilities in…

Landing a new syscall, part 1: What is futex?


Over the past 18 months, we have been on a roller-coaster ride developing futex2, a new set of system calls. As part of this effort, the…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2022. All rights reserved. Privacy Notice. Sitemap.