This project was developed when the first version of BitNet just came out, and the runbit.c is only a naive implementation that just works (forked from llama2.c). There have been significantly better C/C++ kernels for serious BitNet inference. This project aims at the bare-metal implementation to be run on RISC-V cores and their RTL simulators.
make runbit
./runbit bit_model/bit_model_1M.bin # Run 1M BitNet LLaMa2 trained on tinystories
./runbit bit_model/bit_model_3M.bin # Run 3M BitNet LLaMa2 trained on tinystoriesFirst, you need to prepare the VexiiRiscv-based submodule:
make init-hwThen, if you want to generate VexiiRiscv with BNRV extensions:
make gen-hw VEXIIARGS="<vexii args>" BNRV=< 0,4,8,16,32 >"This step needs tools like JVM and SBT that VexiiRiscv requires, please refer VexiiRiscv document for details. You can change base VexiiRiscv configs by setting VEXIIARGS, such as --lsu-l1. And you can switch BNRV implementations by setting BRNV, where version 4 is for buffer-free BNRV and version 8, 16, 32 are for buffered BNRV.
Finally, if you want to simulate BitNet inference on RTL simulator of Vexii and BNRV:
make sim-hw MARCH=< rv32i[m][f][c] > VEXIIARGS="<vexii args>" BNRV=< 0,4,8,16,32 >This step requires: 1. RISC-V GNU toolchain (as risc-v compiler), 2. Verilator (as RTL simulator). Just install them according to the VexiiRiscv documents and tutorials.
Note
In different RISC-V GNU versions, the MARCH flags have different naming standards. In newer versions, please check the march support with --print-multi-lib on gcc.
Notice that you always need to make clean before simulating with another MARCH or BNRV. The simulation usually takes 5~30 minutes (tested on my Intel i7-12700). It's recommended to use MARCH=rv32imfc for fastest simulation.
If you don't want to setup RISC-V GNU, we have prepared a few prebuilt ELFs in riscv/prebuilt. Set USE_PREBUILT=1 to use them in simulation.
The default model and tokenizer is a 3M BitNet LLaMa and a 512-word tokenizer. You can replace them by replacing riscv/bin/model.bin and riscv/bin/tokenizer.bin, and run make clean_bin && make under riscv.
Baseline Result:
make clean
# Running BitNet Inference on RV32IMFC VexiiRiscv without BNRV
make sim-hw MARCH=rv32imfc VEXIIARGS="--with-late-alu --regfile-async"Simulation Output:
[info] Total Time: 79405300
[info] Forward Time: 79077855
[info] BitMatmul Time: 62084022
[info] RMSNorm Time: 201818
[info] Quant Time: 1325638
[info] Attention Time: 106738
[info] GLU Time: 4423640
[info] RoPE Time: 9515329
[info] FMatmul Time: 1319462
Accelerated Result:
make clean
# Running BitNet Inference on RV32IMFC VexiiRiscv with BNRV8
make sim-hw MARCH=rv32imfc VEXIIARGS="--with-late-alu --regfile-async" BNRV=8Simulation Output:
[info] Total Time: 28045398
[info] Forward Time: 27716955
[info] BitMatmul Time: 10704528
[info] RMSNorm Time: 202005
[info] Quant Time: 1346039
[info] Attention Time: 104220
[info] GLU Time: 4426154
[info] RoPE Time: 9515285
[info] FMatmul Time: 1319457
End-to-End Inference Speedup: 79405300 / 28045398 = 2.83x
BitNet MatMul Speedup: 62084022 / 10704528 = 5.80x
Under ASAP 7nm PDK, the VexiiRiscv (w./w.o. BNRV) can run @ 500 MHz, so the token generation speed is accelerated from 6.3 tokens/s to 17.8 tokens/s.
In general, using a better baseline setup (add caches, add branch predictors, dual-issue) will amplify the e2e speedup. The root cause of slow BitNet MatMul on baseline CPU is the heavy bit extraction/unpacking operations, while BNRV bypasses them with hardware support.
An example training setup is given in train_mini.sh. For dataset preparation, please refer to llama2.c.
https://github.com/karpathy/llama2.c
https://github.com/kyegomez/BitNet
https://github.com/cpldcpu/BitNetMCU
https://github.com/microsoft/BitNet
https://github.com/pulp-platform/RVfplib
@inproceedings{bnrv_iccd2025,
title = {BNRV: A Lightweight SIMD Extension for Efficient BitNet Inference on RISC-V CPUs},
author = {Zijun Jiang and Yangdi Lyu},
booktitle = {The 43nd IEEE International Conference on Computer Design (ICCD)},
year = {2025}
}