RKLLM_LLAMA_QWEN

Run Qwen small language models on the Rockchip RK3588S (e.g. Khadas Edge2, Orange Pi 5, Radxa Rock 5) in two flavours:

NPU — via Rockchip's RKLLM runtime, using the dedicated 6 TOPS neural processing unit.
CPU — via llama.cpp, pinned to the 4× Cortex-A76 performance cores.

Everything is driven from a handful of one-command scripts that cross-compile on your host, then deploy the binary, runtime libraries and quantized model to the board over SSH.

Supported models out of the box: Qwen3-0.6B and Qwen2.5-0.5B-Instruct.

Demo

Qwen2.5-0.5B-Instruct running live with the fix_frequencies script applied, showing real-time token throughput alongside CPU/NPU utilisation. Captured on a Khadas Edge2 — but this runs on any RK3588S board (Orange Pi 5, Radxa Rock 5, etc.).

NPU (RKLLM) — 39 tok/s

CPU (llama.cpp) — 28 tok/s

Repository layout

.
├── build_and_load_npu.sh      # NPU: cross-compile llm_demo + deploy to board
├── build_and_load_cpu.sh      # CPU: cross-compile llama-cli + deploy to board
├── convert_to_gguf.sh         # CPU: convert HF weights → quantized GGUF
├── cmake/
│   └── aarch64-linux-gnu-gcc.cmake   # cross-compile toolchain file
├── pipeline/
│   ├── NPU/                    # how_to_export / how_to_build / how_to_run
│   └── CPU/                    # how_to_export / how_to_build / how_to_run
│
│   # Git submodules (see .gitmodules):
├── rknn-llm/                  # Rockchip RKLLM SDK + demos
├── llama.cpp/                 # llama.cpp
├── Qwen3-0.6B/                # HuggingFace model weights
└── Qwen2.5-0.5B-Instruct/     # HuggingFace model weights

Note: the model directories and the two SDKs are git submodules and are not stored in this repo — they are fetched on clone (see below). They are also listed in .gitignore as a safety net so their contents never get committed here.

Getting started

1. Clone with submodules

git clone --recursive https://github.com/alebal123bal/RKLLM_LLAMA_QWEN.git
cd RKLLM_LLAMA_QWEN

Already cloned without --recursive? Pull the submodules in:

git submodule update --init --recursive

The Qwen model submodules point at HuggingFace LFS repos and are several hundred MB each — make sure git-lfs is installed (git lfs install).

2. Prerequisites

On the host (build machine):

aarch64-linux-gnu-gcc / g++ cross-compiler (sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu)
cmake and make
Python 3 with the RKLLM toolkit (NPU export) and llama.cpp's converter deps (CPU export)
SSH access to the board

On the board (RK3588S):

A 64-bit Linux distro
The relevant runtime is uploaded automatically by the deploy scripts

Python environment (`rkllm_qwen`)

The NPU export step uses a dedicated conda environment with the RKLLM toolkit:

conda create -n rkllm_qwen python=3.10 -y
conda activate rkllm_qwen
pip install -r rknn-llm/rkllm-toolkit/packages/requirements.txt
# Toolkit wheel — match cpXY to your Python version (cp310 = 3.10):
pip install rknn-llm/rkllm-toolkit/packages/rkllm_toolkit-1.2.3-cp310-cp310-linux_x86_64.whl
pip install "setuptools<71"

See pipeline/NPU/how_to_export.md for full details.

3. Configure your board

The deploy scripts default to khadas@192.168.1.58. Override per-run with env vars:

BOARD=user@192.168.1.42 bash build_and_load_cpu.sh
BOARD=user@host BOARD_DIR=~/llm bash build_and_load_npu.sh

Variable	Default	Description
`BOARD`	`khadas@192.168.1.58`	SSH `user@host` of the board
`BOARD_DIR`	`~/programs`	Destination directory on board

NPU pipeline (RKLLM)

Step	Command / doc
Export	pipeline/NPU/how_to_export.md — produces a `.rkllm` model
Build & deploy	`bash build_and_load_npu.sh [model_name]` — see pipeline/NPU/how_to_build.md
Run	pipeline/NPU/how_to_run.md

bash build_and_load_npu.sh Qwen3-0.6B

CPU pipeline (llama.cpp)

Step	Command / doc
Export	`bash convert_to_gguf.sh [model_name] [quant]` — see pipeline/CPU/how_to_export.md
Build & deploy	`bash build_and_load_cpu.sh [model_name]` — see pipeline/CPU/how_to_build.md
Run	pipeline/CPU/how_to_run.md

bash convert_to_gguf.sh Qwen3-0.6B Q4_K_S
bash build_and_load_cpu.sh Qwen3-0.6B

On the board, inference is pinned to the A76 cluster for best throughput:

taskset -c 4-7 ./llama-cli -m Qwen3-0.6B-Q4_K_S.gguf -p "Hello" -n 128 -t 4

Licensing

The orchestration scripts and documentation in this repository are released under the MIT License.

The bundled SDKs and models are pulled in as submodules and remain under their own licenses:

rknn-llm — Rockchip
llama.cpp — MIT
Qwen3-0.6B / Qwen2.5-0.5B-Instruct — see each model card

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
Qwen2.5-0.5B-Instruct @ 7ae5576		Qwen2.5-0.5B-Instruct @ 7ae5576
Qwen3-0.6B @ c1899de		Qwen3-0.6B @ c1899de
cmake		cmake
llama.cpp @ 1d971bb		llama.cpp @ 1d971bb
pipeline		pipeline
res/gifs		res/gifs
rknn-llm @ f7df8e5		rknn-llm @ f7df8e5
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
build_and_load_cpu.sh		build_and_load_cpu.sh
build_and_load_npu.sh		build_and_load_npu.sh
convert_to_gguf.sh		convert_to_gguf.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RKLLM_LLAMA_QWEN

Demo

Repository layout

Getting started

1. Clone with submodules

2. Prerequisites

Python environment (`rkllm_qwen`)

3. Configure your board

NPU pipeline (RKLLM)

CPU pipeline (llama.cpp)

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RKLLM_LLAMA_QWEN

Demo

Repository layout

Getting started

1. Clone with submodules

2. Prerequisites

Python environment (rkllm_qwen)

3. Configure your board

NPU pipeline (RKLLM)

CPU pipeline (llama.cpp)

Licensing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Python environment (`rkllm_qwen`)

Packages