December 01, 2023
How can we help developers and maintainers integrate code more efficiently? How can we help the Linux-based tech industry continuously integrate the latest Linux kernel in their system faster, and with less hassle? How can we identify, report, and fix kernel regressions as fast as possible? How can we mitigate maintainer burnout?
We recently returned from Linux Plumbers Conference 2023 in Richmond, Virginia, where we explored these questions with the community, identified common problems, and envisioned how existing solutions can evolve to complement each other to solve bigger problems for the Linux Kernel Ecosystem.
As many know, there are a lot of interesting tools and services to help developers and maintainers to integrate code in the mainline kernel. 0-day, syzbot, kunit, kselftests, regzbot, KernelCI, smatch, kdevops, kvm-xfstests, BPF-CI, drm/ci, and Linux Test Project(LTP), can all add value to the community in one way or another.
However, having such a validation infrastructure along with all the data it can generate does not necessarily mean that we can make the most out of them. There simply is not enough cross-coordination between all the tools and services, making it hard for maintainers to benefit from all the available infrastructure.
With maintainers already burnt out, getting data from different sources may worsen things as it adds extra time to look into these various sources, while also making it hard for them to learn where the highest priorities are. We must address this situation so maintainers can benefit from such an infrastructure and reduce their workload rather than amplify their burnout.
Of course, there is not a one-size-fits-all solution for this situation. Step by step, we must evolve and grow the entire testing and validation infrastructure we have in the community, and improve the overall quality to bring more benefits to the entire ecosystem.
In today's article and follow-up articles in our coming-soon Kernel Integration series, we will bring different discussions not only on the work Collabora is doing, but also on connecting distinct pieces of work happening across the community to design solutions.
Collabora has been contributing to KernelCI for a few years already. We also created Mesa CI and drm/CI. In order to run the tests, we assembled a test laboratory in one of our offices that runs hundreds of thousands of automated tests per month in real hardware.
However, once we started looking at the amount of test data produced, we realized that it was still quite hard to organize and evaluate all the kernel test results available. We lacked the proper tools or knowledge to efficiently do so, so we began a research project.
Test systems for the kernel, as it is the case for KernelCI today, do not have the concept of tracking a test regression across different kernel and hardware configurations. It also does not know how to track the regression over time to understand if it has been reported, or if a fix has been proposed already.
At the time, we began some experiments by developing knowledge about matching different test regressions into an unique kernel regression, identifying flakiness in the tests (or the hardware), and more. We will discuss this topic in more detail in future articles of the Kernel Integration series. If you want to learn more, check out the LPC discussion from Gustavo Padovan and Ricardo Cañuelo, Unifying and improving test regression reporting and tracking (video).
One of the things that we learned in these experiments was that the quality of the tests are not always great. Let’s take the device driver probing for example. The existing test to verify that a device had successfully probed relied on unstable kernel interfaces and would often break between kernel versions, producing flaky results.
To attempt to address that issue, we started working together with the kernel community to develop tests that can give us finer grained insights about the potential location of the failure. We merged a kselftest upstream for device probe on device-tree based hardware and are also working on similar upstream kselftest for ACPI based devices. There is also an effort around USB and PCI. We will have a dedicated blog post about it quite soon, however if you would like a head start, take a look at Nicolas Prado and Laura Nao’s LPC talk, Detecting failed device probes (video).
To conclude for today, for Collabora, beyond just technical achievement, there is a mindset shift that we need to carry out. Kernel Integration is a large and expensive problem that affects the entire Linux-based tech industry. It is not just maintainers and developers who suffer from lacking proper support to do their jobs more efficiently. It is also an entire industry that faces huge challenges trying to keep up with upstream to deliver stability, security and new features to their products and services.
In the coming weeks and months Collabora will share more articles in the Kernel Integration series as we make progress and touch new areas of work. Stay tuned!
Now included in our Debian images & available via our GitLab, you can build a complete, working BL31 (Boot Loader stage 3.1), and replace…
Back in 2022, after a series of issues were found in its design, I made the call to rework some of WirePlumber's fundamentals in order to…
Continuing our Kernel Integration series, we're excited to introduce DRM-CI, a groundbreaking solution that enables developers to test their…
This is the fourth and final part in a series on persian-rug, a Rust crate for interconnected objects. We've touched on the two big limitations:…
One of the key high-level challenges of building Mesa drivers these days is figuring out how to best share code between a Vulkan driver…
Google Open Source have chosen their second group of winners for the 2023 Google Open Source Peer Bonus Program, and Arnaud Ferraris, Senior…