Vineet Suryan
April 17, 2026
Reading time:
Efficient LLM deployment on edge devices is constrained by model size, memory bandwidth, and power. BitNet offers extremely low-bit inference, while ExecuTorch provides a practical on-device runtime. We explore how to bring these together through the Vulkan backend for portable GPU execution.

BitNet explores a more aggressive compression regime than conventional low-bit quantization by using ternary weights, effectively reducing model precision to about 1 bit. In contrast, methods such as GPTQ, LLM-QAT, and OmniQuant generally preserve low perplexity best in the 4–8 bit range, while quality degrades more sharply near 2 bits. This highlights the main trade-off: BitNet accepts some increase in perplexity in exchange for a much smaller model size and lower memory bandwidth cost. For on-device inference, this makes ternary-weight models an attractive alternative design point rather than simply a more extreme version of standard quantization.
ExecuTorch Vulkan already supports dynamic shapes, FP16/FP32, and quantized linear layers through an in-tree GLSL compute shader stack, making it a promising substrate for BitNet-inspired low-bit kernels.
Vulkan is a particularly good fit for BitNet because it enables explicit control over packed low-bit data layouts and compute-shader execution, providing the flexibility needed to implement efficient ternary-weight operators on mobile and embedded GPUs.
BitNet replaces standard linear-layer weights with ternary values, w ∈ {−1, 0, +1}, so model weights can be stored much more compactly and inference becomes more bandwidth-efficient. In bitnet.cpp, this is implemented with specialized low-bit kernels such as I2_S, TL1, and TL2, which pack ternary weights into compact codes and reconstruct them efficiently during computation. For example, I2_S maps each ternary weight to a 2-bit code, while TL1 and TL2 use lookup-table-based packing schemes to further reduce memory traffic.

In our ExecuTorch implementation, these ternary weights are packed into a TQ2 format compatible with block_tq2_0: each block stores 256 weights in 66 bytes using packed 2-bit codes plus an FP16 block scale. During inference, a custom ExecuTorch kernel performs on-the-fly dequantization, unpacking ternary weights, and applying the scaled dot product directly inside the linear operator. The exported model also preserves the BitNet layer structure, including the extra RMSNorms in attention and feed-forward blocks.
To make autoregressive inference work correctly, the implementation also adds custom LLM runtime pieces: a persistent KV-cache update op and custom SDPA handling. The standard Python KV-cache path becomes non-in-place during export, so a C++ cache update operator is used instead to preserve KV state across forward calls. Together, these components provide a complete ExecuTorch path for BitNet-style low-bit LLM inference.
| Model size | Metric/Task family | FP16 | 1.58-bit BitDistill | Gap vs FP16 |
|---|---|---|---|---|
| 0.6B | Classification accuracy | 88.0 | 88.2 | +0.2 |
| 1.7B | Classification accuracy | 89.6 | 89.5 | -0.1 |
| 4B | Classification accuracy | 91.5 | 91.4 | -0.1 |
| 7B | Classification accuracy | 92.4 | 92.2 | -0.2 |
| 13B | Classification accuracy | 93.5 | 93.2 | -0.3 |
| 30B | Classification accuracy | 95.1 | 94.6 | -0.5 |
# Quantize and export BitNet for ExecuTorch cd examples/models/bitnet python3 quantize_bitnet.py python3 build_model.py python3 export_model.py
# Build ExecuTorch with LLM support cmake -DCMAKE_INSTALL_PREFIX=cmake-out \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_BUILD_EXTENSION_LLM=ON \ -DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_LLM=ON \ -Bcmake-out . cmake --build cmake-out -j$(nproc) cd cmake-out && make install && cd ..
# Build runner and run inference cd examples/models/llama cmake -DCMAKE_INSTALL_PREFIX=$(pwd)/../../../cmake-out \ -DCMAKE_BUILD_TYPE=Release \ -Bcmake-out-llama . cmake --build cmake-out-llama --target llama_main -j$(nproc) ./cmake-out-llama/llama_main \ --model_path=../bitnet/bitnet_tq2_llm.pte \ --tokenizer_path=../bitnet/bitnet_tokenizer.model \ --prompt="The capital of France is" \ --num_bos=1
BitNet reduces raw weight storage by about 10× relative to FP16, enabling much smaller model footprints and lower memory bandwidth during inference.
| Model | FP16 (GB) | BitNet b1.58 (GB) | FP16/BitNet |
|---|---|---|---|
| LLaMA-7B | 14.0 | 1.38 | 10.1× |
| LLaMA-13B | 26.0 | 2.57 | 10.1× |
| LLaMA-30B | 60.0 | 5.93 | 10.1× |
| LLaMA-65B | 130.0 | 12.85 | 10.1× |
This work is an initial step toward supporting BitNet-style low-bit datatypes in ExecuTorch through Vulkan. While we demonstrate a working path for ternary-weight inference, much remains to be done, including improving shader efficiency, refining packed-kernel implementations, expanding operator coverage, and supporting a wider range of BitNet-family models. Join us in extending BitNet support in ExecuTorch via Vulkan!
26/05/2026
New upstream BlueZ documentation helps simplify Bluetooth qualification for Linux-based products by mapping supported profiles, test requirements,…
14/05/2026
See how Tyr moves beyond MCU firmware boot to build the group, queue, VM, submission, and completion paths needed to run real Vulkan workloads…
07/05/2026
A complete breakdown of Mesa’s NIR compiler detailing how it optimizes shader memory access with SSA promotion, deref analysis, copy propagation,…
05/05/2026
Collabora brought Bluetooth Auracast broadcasting to MediaTek Genio 700 for Embedded World 2026. Here's the complete, fully Open Source…
22/04/2026
Using our XR expertise, Collabora created a standalone XR experience for our 1% for the Planet partner, SOMAR, to showcase the direct impact…
17/04/2026
BitNet-style ternary brings LLM inference to ExecuTorch via its Vulkan backend, enabling much smaller, bandwidth-efficient models with portable…