An eBPF overview, part 1: Introduction

Adrian Ratiu avatar

Posted on 05/04/2019 by Adrian Ratiu

Share this post:

Interested in learning more about low-level specifics of the eBPF stack? Read on as we take a deep dive, from its VM mechanisms and tools, to running traces on remote, resource-constrained embedded devices.

Note: These blog posts will be focusing on eBPF so for our purpuses BPF and eBPF are the same thing and can be used interchangeably. The name/acronym deasn't mean much anymore because the project evolved way beyond its initial scope. BPF and eBPF are used interchangeably in the series.

  • Part 1 and 2 give a condensed in-depth introduction to eBPF for newcommers or those looking to further their knowledge by taking a deep dive into the lower part of the eBPF stack.
  • Part 3 is an overview of the userspace tooling meant to increase productivity, built on top of the low level VM mechanisms presented in parts 1 and 2.
  • Part 4 focuses on running eBPF programs on embedded systems which are resource constrained, where a full toolchain stack (BCC/LLVM/python/etc) is undesirable. We'll cross-compile and run eBPF programs on 32 bit ARM with smaller embedded-focused tools. Readers interested only in this part can probably skip the rest.
  • Part 5 is about tracing userspace: Up until now our efforts were focused on tracing the kernel, so it's about time we also examine other user processes.

When in doubt, use this flowchart:

eBPF flowchart

What is it?

eBPF is a register-based Virtual Machine using a custom 64 bit RISC instruction set capable of running Just-in-Time native-compiled "BPF programs" inside the Linux kernel with access to a subset of kernel functions and memory. It is a full VM implementation, not to be confused with the Kernel-based VM (KVM) which is a module enabling Linux to act as hypervisor for other VMs. It is also part of the mainline kernel, so it doesn't require any 3rd party modules like other frameworks (*cough* LTTng or SystemTap *cough*), and comes enabled by default in almost all Linux distributions. Readers familiar with DTrace might find this Dtrace/BPFtrace comparison useful.

Reasons for running a full VM inside the kernel revolve around convenience and safety: While all operations done by eBPF programs can be handled via normal kernel modules, direct kernel programming is a really dangerous endeavour - it can cause lock-ups, memory corruption, crash processes, cause security vulnerabilities and other unwanted effects especially on production devices (eBPF is very often used to inspect systems in production), so running native-fast JiT-compiled kernel code via a safe VM becomes valuable for security monitoring and sandboxing, network filtering, program tracing, profiling, debugging and so on. Some quick examples can be found in this excellent eBPF reference.

By design the eBPF VM and its programs are intentionally not Turing complete: no loops are allowed (there is work-in-progress to support bounded loops) so each eBPF program is guaranteed to finish and not hang, all memory access is bounded and type-checked (including registers, a MOV instruction can change the type of a register), there are no null dereferences, a program must have at most BPF_MAXINSNS instructions (default 4096), the "main" function takes a single argument (the context) and so on. These kinds of restrictions enable simple and fast corectness verification when an eBPF program is loaded in the kernel and its instructions get parsed into a directed-acyclic-graph by a verifier module.

Historically the eBPF VM was only available from within the kernel and was used to filter network packets without userspace interference, hence the now-nonsense name of "Berkley Packet Filter". Starting with kernel v3.18 (2014), the VM was also exposed to userspace via the bpf() syscall and uapi/linux/bpf.h which resulted in a freeze of its instruction set present at the time which became a public ABI, though new instructions can still be (and have been) added later.

Because the in-kernel eBPF implementation is licensed under GPLv2, it cannot be easily redistributed by non-GPL users so there is also an alternative Apache-licensed userspace eBPF VM implementation called "uBPF". Legalese aside, this userspace implementation can probably be useful for tracing performance-critical applications which need to avoid the cost of kernel-userspace context switches.

How does it work?

eBPF programs are run by the kernel when events occur, so they can be viewed as a form of function hooking or event driven programming. There is very little use in running on-demand eBPF programs from userspace as all on-demand user calls are already handled via the normal non-VM kernel API calls ("syscalls") where VM bytecode brings little value. Events can be generated from kprobes/uprobes, tracepoints, dtrace probes, sockets and more. This allows to hook and inspect memory in any function at any instruction in both kernel and user processes, to intercept file operations, inspect specific network packets and so on. A good reference is the BPF Features by Linux Kernel Version.

As mentioned before, an event triggers the execution of an attached eBPF program which then can store information in maps, print to ringbuffers or call a subset of kernel functions defined by a special API. An eBPF program can be attached to multiple events and different eBPF programs can also access the same map to share data. A special read/write map called "program array" stores references to other eBPF programs loaded via the bpf() syscall and a succesful lookup in this map triggers a jump with no return to the respective eBPF program. This eBPF nesting is also bounded to avoid infinite recursion loops.

Steps for running an eBPF program:

  1. Userspace sends bytecode to the kernel together with a program type which determines what kernel areas can be accessed.
  2. The kernel runs a verifier on the bytecode to make sure the program is safe to run (kernel/bpf/verifier.c).
  3. The kernel JiT-compiles the bytecode to native code and inserts it in (or attaches to) the specified code location.
  4. The inserted code writes data to ringbuffers or generic key-value maps.
  5. Userspace reads the result values from the shared maps or ringbuffers.

The map and ringbuffer structures are managed by the kernel (like pipes and FIFOs are) independently of the hooked eBPF or user programs accessing them. Accesses are asynchronous via file descriptors and reference counting ensures structures exist as long as there is at least one program accessing them. The hooked JiT-compiled native code usually gets removed when the user process which loaded it terminates, though in some cases it can still continue beyond the loading-process' lifespan.

To ease writing eBPF programs and avoid doing raw bpf() syscalls, the kernel provides a convenient libbpf library containing syscall function wrappers like bpf_load_program and structure definitins like bpf_map which can be linked statically or as a DSO, dual-licensed under the LGPL 2.1 and BSD 2-Clause. The kernel tree also provides some neat examples which use libbpf in samples/bpf/.

An example study

Kernel developers are poor creatures becuase the kernel is a self-contained project: there are no userspace goodies in the kernel - no Glibc, no LLVM, no JavaScript, not even WebAssembly! - this is why the kernel-tree eBPF examples contain raw bytecode or load pre-assembled bytecode files via libbpf. We can see this in sock_example.c, a simple userspace program using eBPF to count how many TCP, UDP and ICMP protocol packets are received on the loopback interface.

We skip the main and open_raw_sock functions because they're trivial and instead focus on test_sock because that's where the magic happens:

static int test_sock(void)
	int sock = -1, map_fd, prog_fd, i, key;
	long long value = 0, tcp_cnt, udp_cnt, icmp_cnt;

	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value),	256, 0);
	if (map_fd < 0) {
		printf("failed to create map '%s'\n", strerror(errno));
		goto cleanup;

	struct bpf_insn prog[] = {
		BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),
		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
	size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);

	prog_fd = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog, insns_cnt,
				   "GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE);
	if (prog_fd < 0) {
		printf("failed to load prog '%s'\n", strerror(errno));
		goto cleanup;

	sock = open_raw_sock("lo");

	if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) < 0) {
		printf("setsockopt %s\n", strerror(errno));
		goto cleanup;

First a BPF map is created which behaves like a fixed-size array of max 256 elements via the libbpf API. The keys index network protocols as defined by IPROTO_* (2 byte words) and the values represent the respective packet counts (4 bytes size). Beside arrays, eBPF maps also implement other data structure types like stacks or queues.

Next the eBPF bytecode instruction array is defined using the conveniently provided kernel macros. We won't go into details on the bytecode here (that will be done in part 2 after the machine is described some more). At a high level the bytecode reads the protocol word from the packet buffer then looks it up in the map and increments its specific packet count.

Then the BPF bytecode is loaded into the kernel and verified for corectness/safety via libbpf's bpf_load_program which returns a fd for reference. The call specifies what type of program this eBPF is which determines what kernel subset it will have access to. Because this is a SOCKET_FILTER, an argument pointing to the current network packet is provided. Finally the eBPF bytecode is attached via the socket layer to a specific raw socket after which it will be run for every received packet regardless of protocol.

All that remains is for the user process to start polling the shared map for the data:

	for (i = 0; i < 10; i++) {
		key = IPPROTO_TCP;
		assert(bpf_map_lookup_elem(map_fd, &key, &tcp_cnt) == 0);

		key = IPPROTO_UDP;
		assert(bpf_map_lookup_elem(map_fd, &key, &udp_cnt) == 0);

		assert(bpf_map_lookup_elem(map_fd, &key, &icmp_cnt) == 0);

		printf("TCP %lld UDP %lld ICMP %lld packets\n",
		       tcp_cnt, udp_cnt, icmp_cnt);


The basics of eBPF were introduced in this first part and we got our feet wet with an example of how to load bytecode and communicate with the eBPF VM. Compiling and running the example is left as an exercise for the reader due to space constraints. We also intentionally stopped short of looking at specific eBPF bytecode instructions as this will be the focus of part 2. In the example we studied, userspace reads eBPF map values from the kernel VM via libbpf directly in C (using 10 x 1 second sleeps of all things!) which is clunky, error-prone and can get complex fast, so in part 3 we'll be looking at higher level tools which automate these VM interactions via scripting or domain-specific languages.

Comments (1)

  1. annonymos:
    Apr 16, 2019 at 01:57 PM

    What a well written article, thank you so much for the explaining the basics of eBPF.

    Reply to this comment

    Reply to this comment

Add a Comment

Allowed tags: <b><i><br>Add a new comment:

Latest Blog Posts

Quick hack: git-pw


A well-known Linux kernel developer once said, a poor craftsman famously complains about his tools, but a good craftsman knows how to choose…

An eBPF overview, part 2: Machine & bytecode


The second part of this series takes a more in-depth look at the eBPF VM and program studied in the first part. Having this low level knowledge…

An eBPF overview, part 1: Introduction


Interested in learning more about low-level specifics of the eBPF stack? Read on as we take a deep dive, from its VM mechanisms and tools,…

Running Android next to Wayland


It's now possible to run Android applications in the same graphical environment as regular Wayland Linux applications with full 3D acceleration.…

Modern USB gadget on Linux & how to integrate it with systemd (Part 2)


In the previous post I introduced you to the subject of USB gadgets implemented as machines running Linux. In this post, we look at how…

Bootstraping a minimal Arch Linux image


In this tutorial, we'll look at how to create a functional and simple Arch Linux virtual machine image, that can have network access, display…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2019. All rights reserved. Website sitemap.