Skip to content

alebal123bal/RKLLM_LLAMA_QWEN

Repository files navigation

RKLLM_LLAMA_QWEN

Run Qwen small language models on the Rockchip RK3588S (e.g. Khadas Edge2, Orange Pi 5, Radxa Rock 5) in two flavours:

  • NPU — via Rockchip's RKLLM runtime, using the dedicated 6 TOPS neural processing unit.
  • CPU — via llama.cpp, pinned to the 4× Cortex-A76 performance cores.

Everything is driven from a handful of one-command scripts that cross-compile on your host, then deploy the binary, runtime libraries and quantized model to the board over SSH.

Supported models out of the box: Qwen3-0.6B and Qwen2.5-0.5B-Instruct.


Demo

Qwen2.5-0.5B-Instruct running live with the fix_frequencies script applied, showing real-time token throughput alongside CPU/NPU utilisation. Captured on a Khadas Edge2 — but this runs on any RK3588S board (Orange Pi 5, Radxa Rock 5, etc.).

NPU (RKLLM) — 39 tok/s

NPU demo

CPU (llama.cpp) — 28 tok/s

CPU demo


Repository layout

.
├── build_and_load_npu.sh      # NPU: cross-compile llm_demo + deploy to board
├── build_and_load_cpu.sh      # CPU: cross-compile llama-cli + deploy to board
├── convert_to_gguf.sh         # CPU: convert HF weights → quantized GGUF
├── cmake/
│   └── aarch64-linux-gnu-gcc.cmake   # cross-compile toolchain file
├── pipeline/
│   ├── NPU/                    # how_to_export / how_to_build / how_to_run
│   └── CPU/                    # how_to_export / how_to_build / how_to_run
│
│   # Git submodules (see .gitmodules):
├── rknn-llm/                  # Rockchip RKLLM SDK + demos
├── llama.cpp/                 # llama.cpp
├── Qwen3-0.6B/                # HuggingFace model weights
└── Qwen2.5-0.5B-Instruct/     # HuggingFace model weights

Note: the model directories and the two SDKs are git submodules and are not stored in this repo — they are fetched on clone (see below). They are also listed in .gitignore as a safety net so their contents never get committed here.


Getting started

1. Clone with submodules

git clone --recursive https://github.com/alebal123bal/RKLLM_LLAMA_QWEN.git
cd RKLLM_LLAMA_QWEN

Already cloned without --recursive? Pull the submodules in:

git submodule update --init --recursive

The Qwen model submodules point at HuggingFace LFS repos and are several hundred MB each — make sure git-lfs is installed (git lfs install).

2. Prerequisites

On the host (build machine):

  • aarch64-linux-gnu-gcc / g++ cross-compiler (sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu)
  • cmake and make
  • Python 3 with the RKLLM toolkit (NPU export) and llama.cpp's converter deps (CPU export)
  • SSH access to the board

On the board (RK3588S):

  • A 64-bit Linux distro
  • The relevant runtime is uploaded automatically by the deploy scripts

Python environment (rkllm_qwen)

The NPU export step uses a dedicated conda environment with the RKLLM toolkit:

conda create -n rkllm_qwen python=3.10 -y
conda activate rkllm_qwen
pip install -r rknn-llm/rkllm-toolkit/packages/requirements.txt
# Toolkit wheel — match cpXY to your Python version (cp310 = 3.10):
pip install rknn-llm/rkllm-toolkit/packages/rkllm_toolkit-1.2.3-cp310-cp310-linux_x86_64.whl
pip install "setuptools<71"

See pipeline/NPU/how_to_export.md for full details.

3. Configure your board

The deploy scripts default to khadas@192.168.1.58. Override per-run with env vars:

BOARD=user@192.168.1.42 bash build_and_load_cpu.sh
BOARD=user@host BOARD_DIR=~/llm bash build_and_load_npu.sh
Variable Default Description
BOARD khadas@192.168.1.58 SSH user@host of the board
BOARD_DIR ~/programs Destination directory on board

NPU pipeline (RKLLM)

Step Command / doc
Export pipeline/NPU/how_to_export.md — produces a .rkllm model
Build & deploy bash build_and_load_npu.sh [model_name] — see pipeline/NPU/how_to_build.md
Run pipeline/NPU/how_to_run.md
bash build_and_load_npu.sh Qwen3-0.6B

CPU pipeline (llama.cpp)

Step Command / doc
Export bash convert_to_gguf.sh [model_name] [quant] — see pipeline/CPU/how_to_export.md
Build & deploy bash build_and_load_cpu.sh [model_name] — see pipeline/CPU/how_to_build.md
Run pipeline/CPU/how_to_run.md
bash convert_to_gguf.sh Qwen3-0.6B Q4_K_S
bash build_and_load_cpu.sh Qwen3-0.6B

On the board, inference is pinned to the A76 cluster for best throughput:

taskset -c 4-7 ./llama-cli -m Qwen3-0.6B-Q4_K_S.gguf -p "Hello" -n 128 -t 4

Licensing

The orchestration scripts and documentation in this repository are released under the MIT License.

The bundled SDKs and models are pulled in as submodules and remain under their own licenses:

About

Entire pipeline for running optimized LLM models on the RK3588S, both on the NPU (RKLLM) or the CPU (llama)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors