Run Qwen small language models on the Rockchip RK3588S (e.g. Khadas Edge2, Orange Pi 5, Radxa Rock 5) in two flavours:
- NPU — via Rockchip's RKLLM runtime, using the dedicated 6 TOPS neural processing unit.
- CPU — via llama.cpp, pinned to the 4× Cortex-A76 performance cores.
Everything is driven from a handful of one-command scripts that cross-compile on your host, then deploy the binary, runtime libraries and quantized model to the board over SSH.
Supported models out of the box:
Qwen3-0.6B and
Qwen2.5-0.5B-Instruct.
Qwen2.5-0.5B-Instruct running live with the fix_frequencies script applied,
showing real-time token throughput alongside CPU/NPU utilisation. Captured on a
Khadas Edge2 — but this runs on any RK3588S board (Orange Pi 5, Radxa Rock 5, etc.).
NPU (RKLLM) — 39 tok/s
CPU (llama.cpp) — 28 tok/s
.
├── build_and_load_npu.sh # NPU: cross-compile llm_demo + deploy to board
├── build_and_load_cpu.sh # CPU: cross-compile llama-cli + deploy to board
├── convert_to_gguf.sh # CPU: convert HF weights → quantized GGUF
├── cmake/
│ └── aarch64-linux-gnu-gcc.cmake # cross-compile toolchain file
├── pipeline/
│ ├── NPU/ # how_to_export / how_to_build / how_to_run
│ └── CPU/ # how_to_export / how_to_build / how_to_run
│
│ # Git submodules (see .gitmodules):
├── rknn-llm/ # Rockchip RKLLM SDK + demos
├── llama.cpp/ # llama.cpp
├── Qwen3-0.6B/ # HuggingFace model weights
└── Qwen2.5-0.5B-Instruct/ # HuggingFace model weights
Note: the model directories and the two SDKs are git submodules and are not stored in this repo — they are fetched on clone (see below). They are also listed in
.gitignoreas a safety net so their contents never get committed here.
git clone --recursive https://github.com/alebal123bal/RKLLM_LLAMA_QWEN.git
cd RKLLM_LLAMA_QWENAlready cloned without --recursive? Pull the submodules in:
git submodule update --init --recursiveThe Qwen model submodules point at HuggingFace LFS repos and are several hundred MB each — make sure
git-lfsis installed (git lfs install).
On the host (build machine):
aarch64-linux-gnu-gcc/g++cross-compiler (sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu)cmakeandmake- Python 3 with the RKLLM toolkit (NPU export) and llama.cpp's converter deps (CPU export)
- SSH access to the board
On the board (RK3588S):
- A 64-bit Linux distro
- The relevant runtime is uploaded automatically by the deploy scripts
The NPU export step uses a dedicated conda environment with the RKLLM toolkit:
conda create -n rkllm_qwen python=3.10 -y
conda activate rkllm_qwen
pip install -r rknn-llm/rkllm-toolkit/packages/requirements.txt
# Toolkit wheel — match cpXY to your Python version (cp310 = 3.10):
pip install rknn-llm/rkllm-toolkit/packages/rkllm_toolkit-1.2.3-cp310-cp310-linux_x86_64.whl
pip install "setuptools<71"See pipeline/NPU/how_to_export.md for full details.
The deploy scripts default to khadas@192.168.1.58. Override per-run with env vars:
BOARD=user@192.168.1.42 bash build_and_load_cpu.sh
BOARD=user@host BOARD_DIR=~/llm bash build_and_load_npu.sh| Variable | Default | Description |
|---|---|---|
BOARD |
khadas@192.168.1.58 |
SSH user@host of the board |
BOARD_DIR |
~/programs |
Destination directory on board |
| Step | Command / doc |
|---|---|
| Export | pipeline/NPU/how_to_export.md — produces a .rkllm model |
| Build & deploy | bash build_and_load_npu.sh [model_name] — see pipeline/NPU/how_to_build.md |
| Run | pipeline/NPU/how_to_run.md |
bash build_and_load_npu.sh Qwen3-0.6B| Step | Command / doc |
|---|---|
| Export | bash convert_to_gguf.sh [model_name] [quant] — see pipeline/CPU/how_to_export.md |
| Build & deploy | bash build_and_load_cpu.sh [model_name] — see pipeline/CPU/how_to_build.md |
| Run | pipeline/CPU/how_to_run.md |
bash convert_to_gguf.sh Qwen3-0.6B Q4_K_S
bash build_and_load_cpu.sh Qwen3-0.6BOn the board, inference is pinned to the A76 cluster for best throughput:
taskset -c 4-7 ./llama-cli -m Qwen3-0.6B-Q4_K_S.gguf -p "Hello" -n 128 -t 4The orchestration scripts and documentation in this repository are released under the MIT License.
The bundled SDKs and models are pulled in as submodules and remain under their own licenses:
- rknn-llm — Rockchip
- llama.cpp — MIT
- Qwen3-0.6B / Qwen2.5-0.5B-Instruct — see each model card

