Vineet Suryan
April 17, 2026
Reading time:
Efficient LLM deployment on edge devices is constrained by model size, memory bandwidth, and power. BitNet offers extremely low-bit inference, while ExecuTorch provides a practical on-device runtime. We explore how to bring these together through the Vulkan backend for portable GPU execution.

BitNet explores a more aggressive compression regime than conventional low-bit quantization by using ternary weights, effectively reducing model precision to about 1 bit. In contrast, methods such as GPTQ, LLM-QAT, and OmniQuant generally preserve low perplexity best in the 4–8 bit range, while quality degrades more sharply near 2 bits. This highlights the main trade-off: BitNet accepts some increase in perplexity in exchange for a much smaller model size and lower memory bandwidth cost. For on-device inference, this makes ternary-weight models an attractive alternative design point rather than simply a more extreme version of standard quantization.
ExecuTorch Vulkan already supports dynamic shapes, FP16/FP32, and quantized linear layers through an in-tree GLSL compute shader stack, making it a promising substrate for BitNet-inspired low-bit kernels.
Vulkan is a particularly good fit for BitNet because it enables explicit control over packed low-bit data layouts and compute-shader execution, providing the flexibility needed to implement efficient ternary-weight operators on mobile and embedded GPUs.
BitNet replaces standard linear-layer weights with ternary values, w ∈ {−1, 0, +1}, so model weights can be stored much more compactly and inference becomes more bandwidth-efficient. In bitnet.cpp, this is implemented with specialized low-bit kernels such as I2_S, TL1, and TL2, which pack ternary weights into compact codes and reconstruct them efficiently during computation. For example, I2_S maps each ternary weight to a 2-bit code, while TL1 and TL2 use lookup-table-based packing schemes to further reduce memory traffic.

In our ExecuTorch implementation, these ternary weights are packed into a TQ2 format compatible with block_tq2_0: each block stores 256 weights in 66 bytes using packed 2-bit codes plus an FP16 block scale. During inference, a custom ExecuTorch kernel performs on-the-fly dequantization, unpacking ternary weights, and applying the scaled dot product directly inside the linear operator. The exported model also preserves the BitNet layer structure, including the extra RMSNorms in attention and feed-forward blocks.
To make autoregressive inference work correctly, the implementation also adds custom LLM runtime pieces: a persistent KV-cache update op and custom SDPA handling. The standard Python KV-cache path becomes non-in-place during export, so a C++ cache update operator is used instead to preserve KV state across forward calls. Together, these components provide a complete ExecuTorch path for BitNet-style low-bit LLM inference.
| Model size | Metric/Task family | FP16 | 1.58-bit BitDistill | Gap vs FP16 |
|---|---|---|---|---|
| 0.6B | Classification accuracy | 88.0 | 88.2 | +0.2 |
| 1.7B | Classification accuracy | 89.6 | 89.5 | -0.1 |
| 4B | Classification accuracy | 91.5 | 91.4 | -0.1 |
| 7B | Classification accuracy | 92.4 | 92.2 | -0.2 |
| 13B | Classification accuracy | 93.5 | 93.2 | -0.3 |
| 30B | Classification accuracy | 95.1 | 94.6 | -0.5 |
# Quantize and export BitNet for ExecuTorch cd examples/models/bitnet python3 quantize_bitnet.py python3 build_model.py python3 export_model.py
# Build ExecuTorch with LLM support cmake -DCMAKE_INSTALL_PREFIX=cmake-out \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_BUILD_EXTENSION_LLM=ON \ -DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_LLM=ON \ -Bcmake-out . cmake --build cmake-out -j$(nproc) cd cmake-out && make install && cd ..
# Build runner and run inference cd examples/models/llama cmake -DCMAKE_INSTALL_PREFIX=$(pwd)/../../../cmake-out \ -DCMAKE_BUILD_TYPE=Release \ -Bcmake-out-llama . cmake --build cmake-out-llama --target llama_main -j$(nproc) ./cmake-out-llama/llama_main \ --model_path=../bitnet/bitnet_tq2_llm.pte \ --tokenizer_path=../bitnet/bitnet_tokenizer.model \ --prompt="The capital of France is" \ --num_bos=1
BitNet reduces raw weight storage by about 10× relative to FP16, enabling much smaller model footprints and lower memory bandwidth during inference.
| Model | FP16 (GB) | BitNet b1.58 (GB) | FP16/BitNet |
|---|---|---|---|
| LLaMA-7B | 14.0 | 1.38 | 10.1× |
| LLaMA-13B | 26.0 | 2.57 | 10.1× |
| LLaMA-30B | 60.0 | 5.93 | 10.1× |
| LLaMA-65B | 130.0 | 12.85 | 10.1× |
This work is an initial step toward supporting BitNet-style low-bit datatypes in ExecuTorch through Vulkan. While we demonstrate a working path for ternary-weight inference, much remains to be done, including improving shader efficiency, refining packed-kernel implementations, expanding operator coverage, and supporting a wider range of BitNet-family models. Join us in extending BitNet support in ExecuTorch via Vulkan!
17/04/2026
BitNet-style ternary brings LLM inference to ExecuTorch via its Vulkan backend, enabling much smaller, bandwidth-efficient models with portable…
23/03/2026
PanVK’s new framebuffer abstraction for Mali GPUs removes OpenGL-specific constraints, unlocking more flexible tiled rendering features…
02/03/2026
Get the recap of Nicolas Frattaroli's FOSDEM talk detailing Rockchip’s mainline progress, including Vulkan 1.4 and NPU support as a vital…
02/12/2025
As an active member of the freedesktop community, Collabora was busy at XDC 2025. Our graphics team delivered five talks, helped out in…
24/11/2025
LE Audio introduces a modern, low-power, low-latency Bluetooth® audio architecture that overcomes the limitations of classic Bluetooth®…
17/11/2025
Collabora’s long-term leadership in KernelCI has delivered a completely revamped architecture, new tooling, stronger infrastructure, and…