We're hiring!
*

Gaining eBPF vision: A new way to trace Linux filesystem disk requests

Gabriel Krisman Bertazi avatar

Gabriel Krisman Bertazi
November 21, 2018

Share this post:

Reading time:

When Brendan Gregg gave his Performance Analysis superpowers with Linux BPF talk during the Open Source Summit in Los Angeles last year, he wasn't messing around. Using the new eBPF tracing tools really feels like you gained some x-ray vision powers since, suddenly, opening the program's hood is no longer necessary to see in details how it is behaving internally.

I had the chance of applying these new gained powers last month, when I was asked to develop a small tool to trace when and what parts of each file was being accessed for the first-time during a program initialization. This request came with a few constraints, like the information had to be available independently of the kind of buffered I/O method being used (synchronous, aio, memory-mapping). It also should be trivial to correlate the block data with which files were being accessed and, most importantly, the tracing code should not result in a large performance impact in the observed system.

What is the right level to install our probe?

I started by investigating where exactly we'd want to place our probe:

Tracing at the block layer level

A tracer at the Block Layer would examine requests in terms of disk blocks, which don't directly correlate to filesystem files. A trace at this level gives insight on which areas of the disk are being read or write, and whether they are organized physically in a contiguous fashion. But it doesn't give you a higher level view of the system in a file basis. Other tools already exist to trace block-level accesses, such as the EBF script biosnoop and the traditional blktrace.

Tracing at the filesystem level

A tracer at the filesystem level exposes data in terms of file blocks, which can resolve to one or more blocks of data in the disk. In an example scenario, an Ext4 filesystem with 4Kib file block sizes, one system page likely corresponds to 1 physical block (in 4K disks) or 4 physical blocks in disks with 512 sector size. The tracing at the filesystem level allows us to look at the file in terms of offsets, such that we ignore disk fragmentation. Different installations might fragment the disk differently, and from an application level perspective, we shouldn't really be interested in the disk layout as much as what file blocks of data we need, in case we want to optimize by prefetching them.

Tracing at the Page Cache level

The page cache is a structure that sits in between the VFS/memory-mapping system and the filesystem layer, and is responsible for managing memory sections already read from the disk. By tracing the inclusion and removal pages in this cache we could gain the "First-access" behavior for free. When a new page is first accessed it is brought to the cache, and further accesses won't need to go to the disk. If the page is eventually dropped from the cache because it is no longer needed, a new use will need to reach the disk, and a new access entry will be logged. In our Performance investigation scenario, was the exactly functionality we were looking for.

The probe

The probe we implemented traces Page Cache Misses inside the Kernel Page Cache handling functions to identify the first time a block is requested, before the request is even submitted to the disk. Further requests to the same memory area (as long as the data is still mapped) will return a hit in the cache, which we don't care about, nor trace. This prevents our code from interfering with further accesses, severely diminishing the impact on performance our probe could have.

By tracing the page cache, we are also capable of differentiating blocks that were directly requested by the user program from blocks requested by the Read Ahead logic inside the kernel. Knowing which blocks were read ahead is very interesting information for application developers and system administrators, since it allows them to tune their systems or applications to prefetch the blocks they want in a sane manner.

As is common with any eBPF application, the code is very straightforward. If we ignore some of the boilerplate stuff required to make eBPF compile, the probe comes down to the following function:

int fblktrace_read_pages(struct pt_regs *ctx, struct address_space *mapping,
             struct list_head *pages, struct page *page,
             unsigned nr_pages, bool is_readahead)
{
    u64 index;
    unsigned blkbits = mapping->host->i_blkbits;
    unsigned long ino = mapping->host->i_ino;
     u64 block_in_file;

    for (int i = 0; i < 32 && nr_pages--; i++) {
        if (pages) {
            pages = pages->prev;
            page = container_of(pages, struct page, lru);
        }
        index = page->index;
        block_in_file = (unsigned long) index << (12 - blkbits);

        bpf_trace_printk("=> inode: %ld: FSBLK=%lu BSIZ=%lu %s\\n",
                 ino, index, 1 << blkbits, is_readahead?"[RA]":"");

    }
    return 0;
}

The function above is installed as a tracer around the =ext4_mpage_readpages=function, by the following snippet:

b.attach_kprobe(event="ext4_mpage_readpages", fn_name="fblktrace_read_pages");

The probe runs every time the in-kernel Page Cache asks the filesystem to fetch some page from the disk for the first time. Which area should be read is indirectly identified by the index and offset of the page in the address space of the process. We use that information to calculate the offset of the file to be loaded, in file blocks, and pass that information, alongside the inode number that identifies the file to the print function.

Usage example

For demonstration purposes we wrote a small program, called touchblk, which reads a file in two ways: with the synchronous read/write system calls, and using the mmap feature. On both cases, we read two arbitrarily chosen regions of the file, the block 34 followed by the block 100.

To run the probe, one needs to install the eBPF compiler available in the package BCC tools. Apart from running this example, the BCC package, already available in many Linux distros, includes a large set of eBPF-based probe examples that you can use to learn how to use this tool and write utilities specific to your needs.

The bcc compiler is provided by the iovisor project at:

https://github.com/iovisor/bcc

Now, let's look at the probe in action.

Read/write system calls

[remote:root@fblktrace ~]$ ./fblktrace
printing...
touchblk-2143  [002] d... 91137.791064: 0: => Open inode 14: fname = test.img
touchblk-2143  [002] .N.. 91137.811093: 0: => inode: 14: FSBLK=34 BSIZ=4096 [RA]
touchblk-2143  [002] .... 91137.828293: 0: => inode: 14: FSBLK=100 BSIZ=4096 [RA]

The output above shows the exact behavior of the test application I described before. Since read/write don't necessarily trigger a read ahead, only two entries are show, for the exact blocks we were looking for. A timestamp and inode number are also given. To improve the output, a second probe was installed to map inode numbers to file names, but this is obviously not necessary. It is used only to simplify user's life.

Memory-mapped access

Below is the output of the memory-mapped version. It is quite long...

[remote:root@fblktrace ~]$ ./fblktrace
printing...
touchblk-2147  [003] d... 91258.462486: 0: => Open inode 14: fname = image
touchblk-2147  [003] .... 91258.480927: 0: => inode: 14: FSBLK=18 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480940: 0: => inode: 14: FSBLK=19 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480942: 0: => inode: 14: FSBLK=20 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480943: 0: => inode: 14: FSBLK=21 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480944: 0: => inode: 14: FSBLK=22 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480945: 0: => inode: 14: FSBLK=23 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480946: 0: => inode: 14: FSBLK=24 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480947: 0: => inode: 14: FSBLK=25 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480948: 0: => inode: 14: FSBLK=26 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480949: 0: => inode: 14: FSBLK=27 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480952: 0: => inode: 14: FSBLK=28 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480952: 0: => inode: 14: FSBLK=29 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480954: 0: => inode: 14: FSBLK=30 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480955: 0: => inode: 14: FSBLK=31 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480955: 0: => inode: 14: FSBLK=32 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480956: 0: => inode: 14: FSBLK=33 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480957: 0: => inode: 14: FSBLK=34 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480958: 0: => inode: 14: FSBLK=35 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480959: 0: => inode: 14: FSBLK=36 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480960: 0: => inode: 14: FSBLK=37 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480961: 0: => inode: 14: FSBLK=38 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480962: 0: => inode: 14: FSBLK=39 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480963: 0: => inode: 14: FSBLK=40 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480966: 0: => inode: 14: FSBLK=41 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480967: 0: => inode: 14: FSBLK=42 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480968: 0: => inode: 14: FSBLK=43 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480969: 0: => inode: 14: FSBLK=44 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480970: 0: => inode: 14: FSBLK=45 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480971: 0: => inode: 14: FSBLK=46 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480972: 0: => inode: 14: FSBLK=47 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480973: 0: => inode: 14: FSBLK=48 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.480974: 0: => inode: 14: FSBLK=49 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498554: 0: => inode: 14: FSBLK=84 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498565: 0: => inode: 14: FSBLK=85 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498566: 0: => inode: 14: FSBLK=86 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498567: 0: => inode: 14: FSBLK=87 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498568: 0: => inode: 14: FSBLK=88 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498569: 0: => inode: 14: FSBLK=89 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498570: 0: => inode: 14: FSBLK=90 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498573: 0: => inode: 14: FSBLK=91 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498574: 0: => inode: 14: FSBLK=92 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498575: 0: => inode: 14: FSBLK=93 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498576: 0: => inode: 14: FSBLK=94 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498577: 0: => inode: 14: FSBLK=95 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498578: 0: => inode: 14: FSBLK=96 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498579: 0: => inode: 14: FSBLK=97 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498580: 0: => inode: 14: FSBLK=98 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498581: 0: => inode: 14: FSBLK=99 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498582: 0: => inode: 14: FSBLK=100 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498583: 0: => inode: 14: FSBLK=101 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498584: 0: => inode: 14: FSBLK=102 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498585: 0: => inode: 14: FSBLK=103 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498586: 0: => inode: 14: FSBLK=104 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498587: 0: => inode: 14: FSBLK=105 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498588: 0: => inode: 14: FSBLK=106 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498589: 0: => inode: 14: FSBLK=107 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498591: 0: => inode: 14: FSBLK=108 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498592: 0: => inode: 14: FSBLK=109 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498593: 0: => inode: 14: FSBLK=110 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498594: 0: => inode: 14: FSBLK=111 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498595: 0: => inode: 14: FSBLK=112 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498596: 0: => inode: 14: FSBLK=113 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498597: 0: => inode: 14: FSBLK=114 BSIZ=4096 [RA]
touchblk-2147  [003] .... 91258.498598: 0: => inode: 14: FSBLK=115 BSIZ=4096 [RA]

Why is it so much longer than the read()/write() syscall example? Because the kernel, in an attempt to optimize expensive I/O operations, assumes that when a specific address is accessed in a non-sequential read file, nearby regions are likely to be needed soon, such that it performs read ahead (RA) I/Os.

The kernel can't assume the next region needed will be the one that immediately follows the block accessed, so it tries to access neighbors both before and after the target block. The number of neighbors that are looked ahead is defined by a filesystem specific parameter in sysfs.

[krisman@dilma sda2]$ cat /sys/fs/ext4/sda2/inode_readahead_blks
32

This parameter tells the kernel to load 32 blocks around the target block during a read ahead. If you go back to the output of the second version of the example code trace and count the blocks that were read for each of the two accesses, you'll observe that there were exactly 32 blocks read for each access, 15 immediately before the target block, and 16 immediately after. This gives a very interesting insight on how the read-ahead mechanism works.

Other kinds of I/O and limitations

This method tries to capture I/O accesses as they go through the page cache, such that other non-buffered mechanisms like Direct I/O will not be traced. This example is also limited to ext4, but it can trivially be expanded to any other Linux filesystem as well.

The full code

As always, the full code is available in the only way we know how: under a Free Software license in a public repository. Enjoy!

https://gitlab.collabora.com/krisman/bcc/blob/master/tools/fblktrace.py

Conclusion

Tracing with eBPF is a very powerful and surgical tool for performance analysis that is within the reach of anyone that takes the time to learn them. And let me tell you, it is definitely worth your time!


Visit Gabriel's blog.

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Automatic regression handling and reporting for the Linux Kernel

14/03/2024

In continuation with our series about Kernel Integration we'll go into more detail about how regression detection, processing, and tracking…

Almost a fully open-source boot chain for Rockchip's RK3588!

21/02/2024

Now included in our Debian images & available via our GitLab, you can build a complete, working BL31 (Boot Loader stage 3.1), and replace…

What's the latest with WirePlumber?

19/02/2024

Back in 2022, after a series of issues were found in its design, I made the call to rework some of WirePlumber's fundamentals in order to…

DRM-CI: A GitLab-CI pipeline for Linux kernel testing

08/02/2024

Continuing our Kernel Integration series, we're excited to introduce DRM-CI, a groundbreaking solution that enables developers to test their…

Persian Rug, Part 4 - The limitations of proxies

23/01/2024

This is the fourth and final part in a series on persian-rug, a Rust crate for interconnected objects. We've touched on the two big limitations:…

How to share code between Vulkan and Gallium

16/01/2024

One of the key high-level challenges of building Mesa drivers these days is figuring out how to best share code between a Vulkan driver…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2024. All rights reserved. Privacy Notice. Sitemap.