BNRV: Lightweight BitNet Acceleration on RISC-V via Custom Insturctions

Notice

This project was developed when the first version of BitNet just came out, and the runbit.c is only a naive implementation that just works (forked from llama2.c). There have been significantly better C/C++ kernels for serious BitNet inference. This project aims at the bare-metal implementation to be run on RISC-V cores and their RTL simulators.

Run BitNet on your CPU

make runbit
./runbit bit_model/bit_model_1M.bin # Run 1M BitNet LLaMa2 trained on tinystories
./runbit bit_model/bit_model_3M.bin # Run 3M BitNet LLaMa2 trained on tinystories

Run BitNet on RISC-V CPU RTL simulator

First, you need to prepare the VexiiRiscv-based submodule:

make init-hw

Then, if you want to generate VexiiRiscv with BNRV extensions:

make gen-hw VEXIIARGS="<vexii args>" BNRV=< 0,4,8,16,32 >"

This step needs tools like JVM and SBT that VexiiRiscv requires, please refer VexiiRiscv document for details. You can change base VexiiRiscv configs by setting VEXIIARGS, such as --lsu-l1. And you can switch BNRV implementations by setting BRNV, where version 4 is for buffer-free BNRV and version 8, 16, 32 are for buffered BNRV.

Finally, if you want to simulate BitNet inference on RTL simulator of Vexii and BNRV:

make sim-hw MARCH=< rv32i[m][f][c] > VEXIIARGS="<vexii args>" BNRV=< 0,4,8,16,32 >

This step requires: 1. RISC-V GNU toolchain (as risc-v compiler), 2. Verilator (as RTL simulator). Just install them according to the VexiiRiscv documents and tutorials.

Note

In different RISC-V GNU versions, the MARCH flags have different naming standards. In newer versions, please check the march support with --print-multi-lib on gcc.

Notice that you always need to make clean before simulating with another MARCH or BNRV. The simulation usually takes 5~30 minutes (tested on my Intel i7-12700). It's recommended to use MARCH=rv32imfc for fastest simulation.

If you don't want to setup RISC-V GNU, we have prepared a few prebuilt ELFs in riscv/prebuilt. Set USE_PREBUILT=1 to use them in simulation.

The default model and tokenizer is a 3M BitNet LLaMa and a 512-word tokenizer. You can replace them by replacing riscv/bin/model.bin and riscv/bin/tokenizer.bin, and run make clean_bin && make under riscv.

Simulation Example

Baseline Result:

make clean
# Running BitNet Inference on RV32IMFC VexiiRiscv without BNRV
make sim-hw MARCH=rv32imfc VEXIIARGS="--with-late-alu --regfile-async"

Simulation Output:

[info] Total Time: 79405300
[info] Forward Time: 79077855
[info] BitMatmul Time: 62084022
[info] RMSNorm Time: 201818
[info] Quant Time: 1325638
[info] Attention Time: 106738
[info] GLU Time: 4423640
[info] RoPE Time: 9515329
[info] FMatmul Time: 1319462

Accelerated Result:

make clean
# Running BitNet Inference on RV32IMFC VexiiRiscv with BNRV8
make sim-hw MARCH=rv32imfc VEXIIARGS="--with-late-alu --regfile-async" BNRV=8

Simulation Output:

[info] Total Time: 28045398
[info] Forward Time: 27716955
[info] BitMatmul Time: 10704528
[info] RMSNorm Time: 202005
[info] Quant Time: 1346039
[info] Attention Time: 104220
[info] GLU Time: 4426154
[info] RoPE Time: 9515285
[info] FMatmul Time: 1319457

End-to-End Inference Speedup: 79405300 / 28045398 = 2.83x

BitNet MatMul Speedup: 62084022 / 10704528 = 5.80x

Under ASAP 7nm PDK, the VexiiRiscv (w./w.o. BNRV) can run @ 500 MHz, so the token generation speed is accelerated from 6.3 tokens/s to 17.8 tokens/s.

In general, using a better baseline setup (add caches, add branch predictors, dual-issue) will amplify the e2e speedup. The root cause of slow BitNet MatMul on baseline CPU is the heavy bit extraction/unpacking operations, while BNRV bypasses them with hardware support.

Training BitNet on TinyStories

An example training setup is given in train_mini.sh. For dataset preparation, please refer to llama2.c.

Acknowledgement

https://github.com/karpathy/llama2.c

https://github.com/kyegomez/BitNet

https://github.com/cpldcpu/BitNetMCU

https://github.com/microsoft/BitNet

https://github.com/pulp-platform/RVfplib

Cite Us

Paper Link

@inproceedings{bnrv_iccd2025,
  title     = {BNRV: A Lightweight SIMD Extension for Efficient BitNet Inference on RISC-V CPUs},
  author    = {Zijun Jiang and Yangdi Lyu},
  booktitle = {The 43nd IEEE International Conference on Computer Design (ICCD)},
  year      = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
bit_model		bit_model
data		data
hw		hw
riscv		riscv
.gitignore		.gitignore
.gitmodules		.gitmodules
BNRV_ICCD25.pdf		BNRV_ICCD25.pdf
Makefile		Makefile
README.md		README.md
bit_model.bin		bit_model.bin
bitlinear.py		bitlinear.py
bitmodel.py		bitmodel.py
export_bitmodel.py		export_bitmodel.py
requirements.txt		requirements.txt
runbit.c		runbit.c
sample_bit.py		sample_bit.py
tinystories.py		tinystories.py
tokenizer.bin		tokenizer.bin
tokenizer.model		tokenizer.model
tokenizer.py		tokenizer.py
train_bit.py		train_bit.py
train_mini.sh		train_mini.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BNRV: Lightweight BitNet Acceleration on RISC-V via Custom Insturctions

Notice

Run BitNet on your CPU

Run BitNet on RISC-V CPU RTL simulator

Simulation Example

Training BitNet on TinyStories

Acknowledgement

Cite Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BNRV: Lightweight BitNet Acceleration on RISC-V via Custom Insturctions

Notice

Run BitNet on your CPU

Run BitNet on RISC-V CPU RTL simulator

Simulation Example

Training BitNet on TinyStories

Acknowledgement

Cite Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages