We're hiring!
*

Bringing BitNet to ExecuTorch via Vulkan

Vineet Suryan avatar

Vineet Suryan
April 17, 2026

Share this post:

Reading time:

Efficient LLM deployment on edge devices is constrained by model size, memory bandwidth, and power. BitNet offers extremely low-bit inference, while ExecuTorch provides a practical on-device runtime. We explore how to bring these together through the Vulkan backend for portable GPU execution.

Collabora - Quantization architecture for TQ2 and TQ1

How small can the model be compressed by concurrent work?

BitNet explores a more aggressive compression regime than conventional low-bit quantization by using ternary weights, effectively reducing model precision to about 1 bit. In contrast, methods such as GPTQ, LLM-QAT, and OmniQuant generally preserve low perplexity best in the 4–8 bit range, while quality degrades more sharply near 2 bits. This highlights the main trade-off: BitNet accepts some increase in perplexity in exchange for a much smaller model size and lower memory bandwidth cost. For on-device inference, this makes ternary-weight models an attractive alternative design point rather than simply a more extreme version of standard quantization.

Why Vulkan?

ExecuTorch Vulkan already supports dynamic shapes, FP16/FP32, and quantized linear layers through an in-tree GLSL compute shader stack, making it a promising substrate for BitNet-inspired low-bit kernels.

Vulkan is a particularly good fit for BitNet because it enables explicit control over packed low-bit data layouts and compute-shader execution, providing the flexibility needed to implement efficient ternary-weight operators on mobile and embedded GPUs.

Core innovation

BitNet replaces standard linear-layer weights with ternary values, w ∈ {−1, 0, +1}, so model weights can be stored much more compactly and inference becomes more bandwidth-efficient. In bitnet.cpp, this is implemented with specialized low-bit kernels such as I2_S, TL1, and TL2, which pack ternary weights into compact codes and reconstruct them efficiently during computation. For example, I2_S maps each ternary weight to a 2-bit code, while TL1 and TL2 use lookup-table-based packing schemes to further reduce memory traffic.

Collabora - Comparison of LLaMA-7B, LLaMA-13B, LLaMA-30B (FP16 and INT4) across Tesla A100, RTX 3090 and GTX 1080Ti memory consumption


In our ExecuTorch implementation, these ternary weights are packed into a TQ2 format compatible with block_tq2_0: each block stores 256 weights in 66 bytes using packed 2-bit codes plus an FP16 block scale. During inference, a custom ExecuTorch kernel performs on-the-fly dequantization, unpacking ternary weights, and applying the scaled dot product directly inside the linear operator. The exported model also preserves the BitNet layer structure, including the extra RMSNorms in attention and feed-forward blocks.

To make autoregressive inference work correctly, the implementation also adds custom LLM runtime pieces: a persistent KV-cache update op and custom SDPA handling. The standard Python KV-cache path becomes non-in-place during export, so a C++ cache update operator is used instead to preserve KV state across forward calls. Together, these components provide a complete ExecuTorch path for BitNet-style low-bit LLM inference.

Model size Metric/Task family FP16 1.58-bit BitDistill Gap vs FP16
0.6B Classification accuracy 88.0 88.2 +0.2
1.7B Classification accuracy 89.6 89.5 -0.1
4B Classification accuracy 91.5 91.4 -0.1
7B Classification accuracy 92.4 92.2 -0.2
13B Classification accuracy 93.5 93.2 -0.3
30B Classification accuracy 95.1 94.6 -0.5

Getting started

# Quantize and export BitNet for ExecuTorch
cd examples/models/bitnet
python3 quantize_bitnet.py
python3 build_model.py
python3 export_model.py
# Build ExecuTorch with LLM support
cmake -DCMAKE_INSTALL_PREFIX=cmake-out \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_BUILD_EXTENSION_LLM=ON \
-DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON \
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_LLM=ON \
-Bcmake-out .
cmake --build cmake-out -j$(nproc)
cd cmake-out && make install && cd ..
# Build runner and run inference
cd examples/models/llama
cmake -DCMAKE_INSTALL_PREFIX=$(pwd)/../../../cmake-out \
-DCMAKE_BUILD_TYPE=Release \
-Bcmake-out-llama .
cmake --build cmake-out-llama --target llama_main -j$(nproc)

./cmake-out-llama/llama_main \
--model_path=../bitnet/bitnet_tq2_llm.pte \
--tokenizer_path=../bitnet/bitnet_tokenizer.model \
--prompt="The capital of France is" \
--num_bos=1

What is the potential impact of this work?

BitNet reduces raw weight storage by about 10× relative to FP16, enabling much smaller model footprints and lower memory bandwidth during inference.

Model FP16 (GB) BitNet b1.58 (GB) FP16/BitNet
LLaMA-7B 14.0 1.38 10.1×
LLaMA-13B 26.0 2.57 10.1×
LLaMA-30B 60.0 5.93 10.1×
LLaMA-65B 130.0 12.85 10.1×

Looking ahead

This work is an initial step toward supporting BitNet-style low-bit datatypes in ExecuTorch through Vulkan. While we demonstrate a working path for ternary-weight inference, much remains to be done, including improving shader efficiency, refining packed-kernel implementations, expanding operator coverage, and supporting a wider range of BitNet-family models. Join us in extending BitNet support in ExecuTorch via Vulkan!

Search the newsroom

Latest Blog Posts

Bringing BitNet to ExecuTorch via Vulkan

17/04/2026

BitNet-style ternary brings LLM inference to ExecuTorch via its Vulkan backend, enabling much smaller, bandwidth-efficient models with portable…

Re-thinking framebuffers in PanVK

23/03/2026

PanVK’s new framebuffer abstraction for Mali GPUs removes OpenGL-specific constraints, unlocking more flexible tiled rendering features…

Running Mainline Linux, U-Boot, and Mesa on Rockchip: A year in review

02/03/2026

Get the recap of Nicolas Frattaroli's FOSDEM talk detailing Rockchip’s mainline progress, including Vulkan 1.4 and NPU support as a vital…

Now streaming: Collabora XDC 2025 presentations

02/12/2025

As an active member of the freedesktop community, Collabora was busy at XDC 2025. Our graphics team delivered five talks, helped out in…

Implementing Bluetooth LE Audio & Auracast on Linux systems

24/11/2025

LE Audio introduces a modern, low-power, low-latency Bluetooth® audio architecture that overcomes the limitations of classic Bluetooth®…

Strengthening KernelCI: New architecture, storage, and integrations

17/11/2025

Collabora’s long-term leadership in KernelCI has delivered a completely revamped architecture, new tooling, stronger infrastructure, and…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2026. All rights reserved. Privacy Notice. Sitemap.