Skip to content

Latest commit

 

History

History
372 lines (278 loc) · 11.9 KB

File metadata and controls

372 lines (278 loc) · 11.9 KB

LiteRT CLI (Preview)

A convenient command-line toolkit to streamline LiteRT related development workflow, including converting, quantizing, compiling, managing, running, and benchmarking LiteRT (TFLite) models on various hardware (CPU / GPU / NPU) across platforms (desktop, mobile, or cloud).

Note

It's a still early preview release under active development, thus has limited platform and feature support, plus possible bugs. We appreciate your patience and feedback to help us improve it.


🚀 Installation

You can install litert-cli-nightly from PyPI or from local clone. LiteRT CLI will install the dependencies on-demands, based on which commands to run, to speed up initial installation.

We support installation using either uv (recommended for ultra-fast dependency resolution) or standard pip within a Python virtual environment.

Option 1: Use UV (Recommended)

uv is an extremely fast Python package manager written in Rust.

# 1. Create a virtual environment with Python 3.13.
# TIP: When meeting dependency resolution error, try to set environment variable:
#    export UV_INDEX_URL=https://pypi.org/simple
uv venv --clear --python=3.13 --seed
source .venv/bin/activate

# 2. Install the package into the active virtual environment
uv pip install litert-cli-nightly

# 3. Run help command
litert --help

Option 2: Use Standard Pip

python3 -m venv .venv
source .venv/bin/activate
pip install -q litert-cli-nightly
litert --help

Option 3: Install from Local Clone (for development)

uv venv --clear --python=3.13 --seed
source .venv/bin/activate
git clone git@github.com:google-ai-edge/LiteRT-CLI.git
cd LiteRT-CLI
uv pip install -e .

Quick Start

Try colab

Try LiteRT CLI Colab to explore different features quickly.

Follow command help

You can always follow litert --help or litert {command} --help to find how to use the CLI tool. Check detailed instructions for each command below.

# Run help command
litert --help

# Download a LiteRT model
litert download --help
litert download litert-community/efficientnet_b1 --file "*.tflite" --output efficientnet

# Run and benchmark a LiteRT model on your devices
litert run --help
litert run efficientnet/efficientnet_b1.tflite --desktop --cpu
litert benchmark --help
litert benchmark efficientnet/efficientnet_b1.tflite --android --gpu

Quick Demos

Check comprehensive usage examples under the examples/ directory, which contains per-command demos and model-specific demos.

If you have cloned the repo, you can run the following commands to see the demos:

# Run all command demos
./examples/run_commands.sh

# Run all model demos
./examples/run_models.sh

# Run a specific model demo
./examples/run_models.sh efficientnet

🤖 Use in Coding Agent

Add the LiteRT CLI skill SKILL.md into your AI coding agent (like Google Antigravity) and try prompts such as:

  • "Download LiteRT model litert-community/efficientnet_b1 and run it on CPU"
  • "Benchmark LiteRT model litert-community/efficientnet_b1 on my Android GPU"
  • "Compile LiteRT model litert-community/efficientnet_b1 for NPU target sm8750"
  • "Visualize LiteRT model litert-community/efficientnet_b1"
  • "Download the FP32 EfficientNet model litert-community/efficientnet_b1 from HuggingFace. Quantize it to INT8 dynamic range (--recipe dynamic_wi8_afp32), then benchmark both the original FP32 model and the newly quantized INT8 model on the GPU of my connected Android device. Compare the average latency and report the throughput speedup."
  • "convert the model Qwen/Qwen1.5-0.5B-Chat from HuggingFace Hub to LiteRT format, and run it locally using the prompt 'Explain edge machine learning in one sentence'."
  • "Download EfficientNet from huggingface repo litert-community/efficientnet_b1 . Offline compile (AOT) the model for the sm8750 target NPU, and output the compiled model into ./models/compiled. Then, run an on-device inference and benchmark using this newly compiled AOT model on the connected Android device's NPU (--npu). Confirm that the graph loads directly without dynamic JIT compilation warmup latency."

The agent will automatically install the necessary tools, including Python virtual environments, litert-cli-nightly, and all required dependencies.


Verified Platforms

Verified in Python 3.13.

  • Host Machines:
    • Linux (Ubuntu)
    • macOS (Apple Silicon): don't support litert compile
    • Windows: partially supported
  • Android:
    • CPU, GPU
    • NPU: Qualcomm, MediaTek (soon), Google Tensor (soon)

Troubleshooting & Tips

  • Always active the virtual environment before running litert command, to avoid conflicts.
  • When uv fails to resolve dependencies, try to set environment variable: export UV_INDEX_URL=https://pypi.org/simple before running uv command.
  • litert compile only supports running on Linux now, and it requires newer Clang has version 18.x.x or above. Try sudo apt install clang libc++-dev libc++abi-dev
  • When run fails on GPU using --gpu flag, try to add both --cpu --gpu flags in the command, then the CLI will try CPU first, and fall back to GPU when CPU fails.
  • When litert run fails on Android device, if the device is not detected, try to run adb kill-server && adb start-server first.
  • When benchmark using --gcp flag, you need to
    1. Join the EAP program in Google AI Edge Portal;
    2. Login to GCP using gcloud auth login;
    3. Set your GCP project using --gcp=<Your-GCP-Project>;
  • When litert visualize fails to launch Model Explorer, try to run litert visualize --stop-all first.

💡 Common Commands

1. Download a model from HuggingFace Hub

# Download only .tflite files
litert download litert-community/MobileNet-v3-large \
  --file "*.tflite" \
  --output mobilenet

# Download full repository
litert download litert-community/MobileNet-v3-large \
  --output mobilenet_full

# Download models using Hugging Face ID (uses HF ID as model reference too)
litert download litert-community/MobileNet-v3-large

# Download models with custom model reference
litert download litert-community/MobileNet-v3-large --model-ref my_model_ref

2. Convert a PyTorch model into a LiteRT model

# Automated HF Conversion
litert convert Qwen/Qwen1.5-0.5B-Chat --output /tmp/qwen

# Automated HF Conversion with INT4 Weight-Only Quantization
litert convert Qwen/Qwen1.5-0.5B-Chat --quantize-recipe weight_only_wi4_afp32 --output /tmp/qwen_w4

# Generic Script Injection with INT8 Dynamic Quantization
litert convert my_model.py --quantize-recipe dynamic_wi8_afp32 --output /tmp/mymodel

3. Quantize a LiteRT model

# Dynamic INT8 Quantization (Default)
litert quantize model.tflite \
  --recipe dynamic_wi8_afp32 \
  --output dynamic.tflite

# Weight-Only Quantization
litert quantize model.tflite \
  --recipe weight_only_wi8_afp32 \
  --output weight_only.tflite

# Static Range Quantization (requires calibration data)
litert quantize model.tflite \
  --recipe static_wi8_ai8 \
  --calibration-data calib_data.py \
  --output static.tflite

# Custom JSON Recipe
litert quantize model.tflite \
  --custom-recipe recipe.json \
  --output recipe.tflite

4. Compile a LiteRT model for NPU AOT

Note

Currently only supported on Linux hosts and Qualcomm NPUs, and other NPUs are coming soon!

# Basic compilation for specific Qualcomm NPU (e.g., sm8750 in Xiaomi 15 Pro)
litert compile model.tflite --target sm8750

# Compile for multiple targets and export an AI Pack for Android
litert compile model.tflite --target sm8750 --target mt6989 --export-aipack my_npu_models

5. Run a LiteRT model on Desktop or Android

# Run locally on desktop (CPU)
litert run model.tflite --desktop --cpu
litert run my_model_ref --desktop --cpu

# Run with GPU acceleration and CPU fallback (multi-accelerator)
litert run model.tflite --gpu --cpu
litert run model.tflite --accelerator gpu,cpu

# Run on connected Android device
litert run model.tflite --android

# Run on connected Android device with NPU acceleration and CPU fallback
litert run model.tflite --android --npu --cpu
litert run model.tflite --android --accelerator npu,cpu

# Run on connected Android device with NPU AOT-compiled model
litert run model_sm8450.tflite --android --npu

# Run multiple iterations and print output tensors
litert run model.tflite \
  --iterations 5 \
  --print-tensors

# Run with custom input formats (supports image, raw binary, numpy array)
litert run model.tflite \
  --input "image.png" \
  --print-tensors

6. Benchmark a model's performance

# Benchmark on Android (CPU side)
litert benchmark my_model_ref --android --cpu
litert benchmark model.tflite --android --cpu

# Benchmark on Android NPU (JIT mode)
litert benchmark model.tflite --android --npu

# Benchmark AOT compiled model on Android NPU
litert benchmark model_sm8450.tflite --android --npu

# Benchmark on Android GPU
litert benchmark model.tflite --android --gpu

# Benchmark on macOS (CPU)
litert benchmark my_model_ref --desktop --cpu

# Benchmark on Google AI Edge Portal in Google Cloud. Prerequisites:
# - Set up your Google AI Edge Portal account by following up the instructions at:
#   https://ai.google.dev/edge/ai-edge-portal
# - Set up authentication by running: gcloud auth login
# - You can set the default GCP project by setting the environment variable LITERT_GCP_PROJECT, or by providing the --gcp-project option.
# - You can specific your GCP bucket by --gcp-bucket, otherwise, it will create default
#   one.
litert benchmark model.tflite --gcp --device "pixel 7" --gcp-project "your-gcp-project-id" --gcp-bucket "your-gcp-bucket"
litert benchmark model.tflite --gcp --devices "pixel 7, sm-s931u1" --gpu

7. Visualize a model's architecture

# Open in Model Explorer graph
litert visualize model.tflite

# Clean up and stop visualizer background servers
litert visualize --stop-all

8. Import a local model

# Import a local file into the centralized cache
litert import my_model.tflite --model-ref my_model

# Import a directory and associate with a Hugging Face ID
litert import ./my_model_dir --model-ref my_model --hf-id my_org_name/my_model

9. List managed models

# List all managed models
litert list

# Show detailed contents of a specific model
litert list my_model

10. Delete a managed model

# Delete a model from cache
litert delete my_model

11. Run and benchmark a generative LLM model using LiteRT-LM CLI

# Run a generative LLM model, and load from hugging face
litert lm run \
  --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  --prompt="What is the capital of France?"

# Or load from local LLM model file
litert lm run gemma-4-E2B-it.litertlm

# Example with a custom prompt
litert lm run gemma-4-E2B-it.litertlm --prompt "Hello, how are you?"

# Benchmark a generative LLM model
litert lm benchmark gemma-4-E2B-it.litertlm

12. Clean up all caches

# Clean up local cache, like model files and binaries.
litert clean