Bringing BitNet to ExecuTorch via Vulkan

Bringing BitNet to ExecuTorch via Vulkan

Vineet Suryan
April 17, 2026

Share this post:

Reading time:

Efficient LLM deployment on edge devices is constrained by model size, memory bandwidth, and power. BitNet offers extremely low-bit inference, while ExecuTorch provides a practical on-device runtime. We explore how to bring these together through the Vulkan backend for portable GPU execution.

Collabora - Quantization architecture for TQ2 and TQ1

How small can the model be compressed by concurrent work?

BitNet explores a more aggressive compression regime than conventional low-bit quantization by using ternary weights, effectively reducing model precision to about 1 bit. In contrast, methods such as GPTQ, LLM-QAT, and OmniQuant generally preserve low perplexity best in the 4–8 bit range, while quality degrades more sharply near 2 bits. This highlights the main trade-off: BitNet accepts some increase in perplexity in exchange for a much smaller model size and lower memory bandwidth cost. For on-device inference, this makes ternary-weight models an attractive alternative design point rather than simply a more extreme version of standard quantization.

Why Vulkan?

ExecuTorch Vulkan already supports dynamic shapes, FP16/FP32, and quantized linear layers through an in-tree GLSL compute shader stack, making it a promising substrate for BitNet-inspired low-bit kernels.

Vulkan is a particularly good fit for BitNet because it enables explicit control over packed low-bit data layouts and compute-shader execution, providing the flexibility needed to implement efficient ternary-weight operators on mobile and embedded GPUs.

Core innovation

BitNet replaces standard linear-layer weights with ternary values, w ∈ {−1, 0, +1}, so model weights can be stored much more compactly and inference becomes more bandwidth-efficient. In bitnet.cpp, this is implemented with specialized low-bit kernels such as I2_S, TL1, and TL2, which pack ternary weights into compact codes and reconstruct them efficiently during computation. For example, I2_S maps each ternary weight to a 2-bit code, while TL1 and TL2 use lookup-table-based packing schemes to further reduce memory traffic.

Collabora - Comparison of LLaMA-7B, LLaMA-13B, LLaMA-30B (FP16 and INT4) across Tesla A100, RTX 3090 and GTX 1080Ti memory consumption

In our ExecuTorch implementation, these ternary weights are packed into a TQ2 format compatible with block_tq2_0: each block stores 256 weights in 66 bytes using packed 2-bit codes plus an FP16 block scale. During inference, a custom ExecuTorch kernel performs on-the-fly dequantization, unpacking ternary weights, and applying the scaled dot product directly inside the linear operator. The exported model also preserves the BitNet layer structure, including the extra RMSNorms in attention and feed-forward blocks.

To make autoregressive inference work correctly, the implementation also adds custom LLM runtime pieces: a persistent KV-cache update op and custom SDPA handling. The standard Python KV-cache path becomes non-in-place during export, so a C++ cache update operator is used instead to preserve KV state across forward calls. Together, these components provide a complete ExecuTorch path for BitNet-style low-bit LLM inference.

Model size	Metric/Task family	FP16	1.58-bit BitDistill	Gap vs FP16
0.6B	Classification accuracy	88.0	88.2	+0.2
1.7B	Classification accuracy	89.6	89.5	-0.1
4B	Classification accuracy	91.5	91.4	-0.1
7B	Classification accuracy	92.4	92.2	-0.2
13B	Classification accuracy	93.5	93.2	-0.3
30B	Classification accuracy	95.1	94.6	-0.5

Getting started

# Quantize and export BitNet for ExecuTorch
cd examples/models/bitnet
python3 quantize_bitnet.py
python3 build_model.py
python3 export_model.py

# Build ExecuTorch with LLM support
cmake -DCMAKE_INSTALL_PREFIX=cmake-out \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_BUILD_EXTENSION_LLM=ON \
-DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON \
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-DEXECUTORCH_BUILD_KERNELS_LLM=ON \
-Bcmake-out .
cmake --build cmake-out -j$(nproc)
cd cmake-out && make install && cd ..

# Build runner and run inference
cd examples/models/llama
cmake -DCMAKE_INSTALL_PREFIX=$(pwd)/../../../cmake-out \
-DCMAKE_BUILD_TYPE=Release \
-Bcmake-out-llama .
cmake --build cmake-out-llama --target llama_main -j$(nproc)

./cmake-out-llama/llama_main \
--model_path=../bitnet/bitnet_tq2_llm.pte \
--tokenizer_path=../bitnet/bitnet_tokenizer.model \
--prompt="The capital of France is" \
--num_bos=1

What is the potential impact of this work?

BitNet reduces raw weight storage by about 10× relative to FP16, enabling much smaller model footprints and lower memory bandwidth during inference.

Model	FP16 (GB)	BitNet b1.58 (GB)	FP16/BitNet
LLaMA-7B	14.0	1.38	10.1×
LLaMA-13B	26.0	2.57	10.1×
LLaMA-30B	60.0	5.93	10.1×
LLaMA-65B	130.0	12.85	10.1×

Looking ahead

This work is an initial step toward supporting BitNet-style low-bit datatypes in ExecuTorch through Vulkan. While we demonstrate a working path for ternary-weight inference, much remains to be done, including improving shader efficiency, refining packed-kernel implementations, expanding operator coverage, and supporting a wider range of BitNet-family models. Join us in extending BitNet support in ExecuTorch via Vulkan!

Font recognition reimagined with FasterViT-2

Collabora takes first place at ICME 2025 Grand Challenge

Breaking language barriers 2.0: Moving closer towards fully reliable, production-ready Hindi ASR

Font recognition reimagined with FasterViT-2

Collabora takes first place at ICME 2025 Grand Challenge

Breaking language barriers 2.0: Moving closer towards fully reliable, production-ready Hindi ASR

Search the newsroom

Latest Blog Posts

The power of APIs: The unsung hero of AI interface

07/07/2026

AI development is shifting from implementing models from scratch to composing powerful capabilities via APIs, enabling developers to integrate…

Simplifying Bluetooth qualification for Linux/BlueZ: New upstream documentation

26/05/2026

New upstream BlueZ documentation helps simplify Bluetooth qualification for Linux-based products by mapping supported profiles, test requirements,…

Building Tyr in Rust: CSF architecture and booting the MCU

14/05/2026

See how Tyr moves beyond MCU firmware boot to build the group, queue, VM, submission, and completion paths needed to run real Vulkan workloads…

Optimizing memory access in NIR

07/05/2026

A complete breakdown of Mesa’s NIR compiler detailing how it optimizes shader memory access with SSA promotion, deref analysis, copy propagation,…

BlueZ-powered Auracast broadcasting on Genio 700

05/05/2026

Collabora brought Bluetooth Auracast broadcasting to MediaTek Genio 700 for Embedded World 2026. Here's the complete, fully Open Source…

Making the invisible audible: Building an OpenXR experience for ocean protection

22/04/2026

Using our XR expertise, Collabora created a standalone XR experience for our 1% for the Planet partner, SOMAR, to showcase the direct impact…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기