We're hiring!
*

An eBPF overview, part 1: Introduction

Adrian Ratiu avatar

Adrian Ratiu
April 05, 2019

Share this post:

Interested in learning more about low-level specifics of the eBPF stack? Read on as we take a deep dive, from its VM mechanisms and tools, to running traces on remote, resource-constrained embedded devices.

Note: These blog posts will be focusing on eBPF so for our purpuses BPF and eBPF are the same thing and can be used interchangeably. The name/acronym deasn't mean much anymore because the project evolved way beyond its initial scope. BPF and eBPF are used interchangeably in the series.

  • Part 1 and Part 2 give a condensed in-depth introduction to eBPF for newcommers or those looking to further their knowledge by taking a deep dive into the lower part of the eBPF stack.
  • Part 3 is an overview of the userspace tooling meant to increase productivity, built on top of the low level VM mechanisms presented in parts 1 and 2.
  • Part 4 focuses on running eBPF programs on embedded systems which are resource constrained, where a full toolchain stack (BCC/LLVM/python/etc) is undesirable. We'll cross-compile and run eBPF programs on 32 bit ARM with smaller embedded-focused tools. Readers interested only in this part can probably skip the rest.
  • Part 5 is about tracing userspace: Up until now our efforts were focused on tracing the kernel, so it's about time we also examine other user processes.

When in doubt, use this flowchart:

eBPF flowchart

What is it?

eBPF is a register-based Virtual Machine using a custom 64 bit RISC instruction set capable of running Just-in-Time native-compiled "BPF programs" inside the Linux kernel with access to a subset of kernel functions and memory. It is a full VM implementation, not to be confused with the Kernel-based VM (KVM) which is a module enabling Linux to act as hypervisor for other VMs. It is also part of the mainline kernel, so it doesn't require any 3rd party modules like other frameworks (*cough* LTTng or SystemTap *cough*), and comes enabled by default in almost all Linux distributions. Readers familiar with DTrace might find this Dtrace/BPFtrace comparison useful.

Reasons for running a full VM inside the kernel revolve around convenience and safety: While all operations done by eBPF programs can be handled via normal kernel modules, direct kernel programming is a really dangerous endeavour - it can cause lock-ups, memory corruption, crash processes, cause security vulnerabilities and other unwanted effects especially on production devices (eBPF is very often used to inspect systems in production), so running native-fast JiT-compiled kernel code via a safe VM becomes valuable for security monitoring and sandboxing, network filtering, program tracing, profiling, debugging and so on. Some quick examples can be found in this excellent eBPF reference.

By design the eBPF VM and its programs are intentionally not Turing complete: no loops are allowed (there is work-in-progress to support bounded loops) so each eBPF program is guaranteed to finish and not hang, all memory access is bounded and type-checked (including registers, a MOV instruction can change the type of a register), there are no null dereferences, a program must have at most BPF_MAXINSNS instructions (default 4096), the "main" function takes a single argument (the context) and so on. These kinds of restrictions enable simple and fast correctness verification when an eBPF program is loaded in the kernel and its instructions get parsed into a directed-acyclic-graph by a verifier module.

Historically the eBPF VM was only available from within the kernel and was used to filter network packets without userspace interference, hence the now-nonsense name of "Berkley Packet Filter". Starting with kernel v3.18 (2014), the VM was also exposed to userspace via the bpf() syscall and uapi/linux/bpf.h which resulted in a freeze of its instruction set present at the time which became a public ABI, though new instructions can still be (and have been) added later.

Because the in-kernel eBPF implementation is licensed under GPLv2, it cannot be easily redistributed by non-GPL users so there is also an alternative Apache-licensed userspace eBPF VM implementation called "uBPF". Legalese aside, this userspace implementation can probably be useful for tracing performance-critical applications which need to avoid the cost of kernel-userspace context switches.

How does it work?

eBPF programs are run by the kernel when events occur, so they can be viewed as a form of function hooking or event driven programming. There is very little use in running on-demand eBPF programs from userspace as all on-demand user calls are already handled via the normal non-VM kernel API calls ("syscalls") where VM bytecode brings little value. Events can be generated from kprobes/uprobes, tracepoints, dtrace probes, sockets and more. This allows to hook and inspect memory in any function at any instruction in both kernel and user processes, to intercept file operations, inspect specific network packets and so on. A good reference is the BPF Features by Linux Kernel Version.

As mentioned before, an event triggers the execution of an attached eBPF program which then can store information in maps, print to ringbuffers or call a subset of kernel functions defined by a special API. An eBPF program can be attached to multiple events and different eBPF programs can also access the same map to share data. A special read/write map called "program array" stores references to other eBPF programs loaded via the bpf() syscall and a succesful lookup in this map triggers a jump with no return to the respective eBPF program. This eBPF nesting is also bounded to avoid infinite recursion loops.

Steps for running an eBPF program:

  1. Userspace sends bytecode to the kernel together with a program type which determines what kernel areas can be accessed.
  2. The kernel runs a verifier on the bytecode to make sure the program is safe to run (kernel/bpf/verifier.c).
  3. The kernel JiT-compiles the bytecode to native code and inserts it in (or attaches to) the specified code location.
  4. The inserted code writes data to ringbuffers or generic key-value maps.
  5. Userspace reads the result values from the shared maps or ringbuffers.

The map and ringbuffer structures are managed by the kernel (like pipes and FIFOs are) independently of the hooked eBPF or user programs accessing them. Accesses are asynchronous via file descriptors and reference counting ensures structures exist as long as there is at least one program accessing them. The hooked JiT-compiled native code usually gets removed when the user process which loaded it terminates, though in some cases it can still continue beyond the loading-process' lifespan.

To ease writing eBPF programs and avoid doing raw bpf() syscalls, the kernel provides a convenient libbpf library containing syscall function wrappers like bpf_load_program and structure definitions like bpf_map which can be linked statically or as a DSO, dual-licensed under the LGPL 2.1 and BSD 2-Clause. The kernel tree also provides some neat examples which use libbpf in samples/bpf/.

An example study

Kernel developers are poor creatures because the kernel is a self-contained project: there are no userspace goodies in the kernel - no Glibc, no LLVM, no JavaScript, not even WebAssembly! - this is why the kernel-tree eBPF examples contain raw bytecode or load pre-assembled bytecode files via libbpf. We can see this in sock_example.c, a simple userspace program using eBPF to count how many TCP, UDP and ICMP protocol packets are received on the loopback interface.

We skip the main and open_raw_sock functions because they're trivial and instead focus on test_sock because that's where the magic happens:

static int test_sock(void)
{
	int sock = -1, map_fd, prog_fd, i, key;
	long long value = 0, tcp_cnt, udp_cnt, icmp_cnt;

	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value),	256, 0);
	if (map_fd < 0) {
		printf("failed to create map '%s'\n", strerror(errno));
		goto cleanup;
	}

	struct bpf_insn prog[] = {
		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
		BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),
		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
		BPF_EXIT_INSN(),
	};
	size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);

	prog_fd = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog, insns_cnt,
				   "GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE);
	if (prog_fd < 0) {
		printf("failed to load prog '%s'\n", strerror(errno));
		goto cleanup;
	}

	sock = open_raw_sock("lo");

	if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) < 0) {
		printf("setsockopt %s\n", strerror(errno));
		goto cleanup;
	}

First a BPF map is created which behaves like a fixed-size array of max 256 elements via the libbpf API. The keys index network protocols as defined by IPROTO_* (2 byte words) and the values represent the respective packet counts (4 bytes size). Beside arrays, eBPF maps also implement other data structure types like stacks or queues.

Next the eBPF bytecode instruction array is defined using the conveniently provided kernel macros. We won't go into details on the bytecode here (that will be done in part 2 after the machine is described some more). At a high level the bytecode reads the protocol word from the packet buffer then looks it up in the map and increments its specific packet count.

Then the BPF bytecode is loaded into the kernel and verified for corectness/safety via libbpf's bpf_load_program which returns a fd for reference. The call specifies what type of program this eBPF is which determines what kernel subset it will have access to. Because this is a SOCKET_FILTER, an argument pointing to the current network packet is provided. Finally the eBPF bytecode is attached via the socket layer to a specific raw socket after which it will be run for every received packet regardless of protocol.

All that remains is for the user process to start polling the shared map for the data:

	for (i = 0; i < 10; i++) {
		key = IPPROTO_TCP;
		assert(bpf_map_lookup_elem(map_fd, &key, &tcp_cnt) == 0);

		key = IPPROTO_UDP;
		assert(bpf_map_lookup_elem(map_fd, &key, &udp_cnt) == 0);

		key = IPPROTO_ICMP;
		assert(bpf_map_lookup_elem(map_fd, &key, &icmp_cnt) == 0);

		printf("TCP %lld UDP %lld ICMP %lld packets\n",
		       tcp_cnt, udp_cnt, icmp_cnt);
		sleep(1);
	}
}

Summary

The basics of eBPF were introduced in this first part and we got our feet wet with an example of how to load bytecode and communicate with the eBPF VM. Compiling and running the example is left as an exercise for the reader due to space constraints. We also intentionally stopped short of looking at specific eBPF bytecode instructions as this will be the focus of part 2. In the example we studied, userspace reads eBPF map values from the kernel VM via libbpf directly in C (using 10 x 1 second sleeps of all things!) which is clunky, error-prone and can get complex fast, so in part 3 we'll be looking at higher level tools which automate these VM interactions via scripting or domain-specific languages.

Continue reading (An eBPF overview, part 2: Machine & bytecode)

Comments (7)

  1. annonymos:
    Apr 16, 2019 at 01:57 PM

    What a well written article, thank you so much for the explaining the basics of eBPF.

    Reply to this comment

    Reply to this comment

  2. annonymous:
    Apr 19, 2019 at 07:45 PM

    Thanks for the article!

    s/definitins/definitions/
    s/becuase/because/

    Reply to this comment

    Reply to this comment

    1. Adrian Ratiu:
      Apr 23, 2019 at 03:38 PM

      Thank you for spotting the errors, they have been corrected!

      Reply to this comment

      Reply to this comment

  3. Tiago Katcipis:
    Apr 28, 2019 at 07:15 PM

    Great stuff Adrian, thanks =).

    There is a small typo: s/corectness/correctness

    Reply to this comment

    Reply to this comment

  4. anon:
    Apr 29, 2019 at 06:29 PM

    The link to "other data structure types" is wrong, it goes to the IP_PROTO enum.

    Reply to this comment

    Reply to this comment


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

New graphing tool for PipeWire debugging

09/12/2019

PipeWire, the new and emerging open source framework that aims to greatly improve the exchange and management of audio and video streams…

Building GStreamer on Windows

26/11/2019

With the advent of meson and gst-build, it is now possible to set up a GStreamer Windows development environment that rivals the finest…

Zink: Fall Update

24/10/2019

I recently went to XDC 2019, where I gave yet another talk about Zink. I kinda forgot to write a blog-post about it, so here’s me trying…

Adding stateless support to vicodec

09/10/2019

Prior to joining Collabora, I took part in Round 17 of the Outreachy internships, to work on the virtual drivers in the media subsystem…

Why HDCP support in Weston is a good thing

03/10/2019

What HDCP is, and why supporting HDCP in Weston is justified in both an economical and technical context.

Virglrenderer and the state of virtualized virtual worlds

28/08/2019

With the release of virglrenderer 0.8.0, getting accelerated OpenGL within a virtual machine (VM) made a big leap forward. Since virglrenderer-0.7.0,…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2019. All rights reserved. Website sitemap.