A convenient command-line toolkit to streamline LiteRT related development workflow, including converting, quantizing, compiling, managing, running, and benchmarking LiteRT (TFLite) models on various hardware (CPU / GPU / NPU) across platforms (desktop, mobile, or cloud).
Note
It's a still early preview release under active development, thus has limited platform and feature support, plus possible bugs. We appreciate your patience and feedback to help us improve it.
You can install litert-cli-nightly from PyPI or from local clone. LiteRT CLI
will install the dependencies on-demands, based on which commands to run, to
speed up initial installation.
We support installation using either uv (recommended for ultra-fast dependency resolution) or standard pip within a Python virtual environment.
uv is an extremely fast Python package manager written in Rust.
# 1. Create a virtual environment with Python 3.13.
# TIP: When meeting dependency resolution error, try to set environment variable:
# export UV_INDEX_URL=https://pypi.org/simple
uv venv --clear --python=3.13 --seed
source .venv/bin/activate
# 2. Install the package into the active virtual environment
uv pip install litert-cli-nightly
# 3. Run help command
litert --helppython3 -m venv .venv
source .venv/bin/activate
pip install -q litert-cli-nightly
litert --helpuv venv --clear --python=3.13 --seed
source .venv/bin/activate
git clone git@github.com:google-ai-edge/LiteRT-CLI.git
cd LiteRT-CLI
uv pip install -e .Try LiteRT CLI Colab to explore different features quickly.
You can always follow litert --help or litert {command} --help to find how
to use the CLI tool. Check detailed instructions for each command below.
# Run help command
litert --help
# Download a LiteRT model
litert download --help
litert download litert-community/efficientnet_b1 --file "*.tflite" --output efficientnet
# Run and benchmark a LiteRT model on your devices
litert run --help
litert run efficientnet/efficientnet_b1.tflite --desktop --cpu
litert benchmark --help
litert benchmark efficientnet/efficientnet_b1.tflite --android --gpuCheck comprehensive usage examples under the examples/ directory, which contains per-command demos and model-specific demos.
If you have cloned the repo, you can run the following commands to see the demos:
# Run all command demos
./examples/run_commands.sh
# Run all model demos
./examples/run_models.sh
# Run a specific model demo
./examples/run_models.sh efficientnetAdd the LiteRT CLI skill
SKILL.md
into your AI coding agent (like Google Antigravity) and try prompts such as:
- "Download LiteRT model
litert-community/efficientnet_b1and run it on CPU" - "Benchmark LiteRT model
litert-community/efficientnet_b1on my Android GPU" - "Compile LiteRT model
litert-community/efficientnet_b1for NPU targetsm8750" - "Visualize LiteRT model
litert-community/efficientnet_b1" - "Download the FP32 EfficientNet model
litert-community/efficientnet_b1from HuggingFace. Quantize it to INT8 dynamic range (--recipe dynamic_wi8_afp32), then benchmark both the original FP32 model and the newly quantized INT8 model on the GPU of my connected Android device. Compare the average latency and report the throughput speedup." - "convert the model
Qwen/Qwen1.5-0.5B-Chatfrom HuggingFace Hub to LiteRT format, and run it locally using the prompt 'Explain edge machine learning in one sentence'." - "Download EfficientNet from huggingface repo
litert-community/efficientnet_b1. Offline compile (AOT) the model for thesm8750target NPU, and output the compiled model into./models/compiled. Then, run an on-device inference and benchmark using this newly compiled AOT model on the connected Android device's NPU (--npu). Confirm that the graph loads directly without dynamic JIT compilation warmup latency."
The agent will automatically install the necessary tools, including Python
virtual environments, litert-cli-nightly, and all required dependencies.
Verified in Python 3.13.
- Host Machines:
- Linux (Ubuntu)
- macOS (Apple Silicon): don't support
litert compile - Windows: partially supported
- Android:
- CPU, GPU
- NPU: Qualcomm, MediaTek (soon), Google Tensor (soon)
- Always active the virtual environment before running
litertcommand, to avoid conflicts. - When
uvfails to resolve dependencies, try to set environment variable:export UV_INDEX_URL=https://pypi.org/simplebefore runninguvcommand. litert compileonly supports running on Linux now, and it requires newer Clang has version18.x.xor above. Trysudo apt install clang libc++-dev libc++abi-dev- When run fails on GPU using
--gpuflag, try to add both--cpu --gpuflags in the command, then the CLI will try CPU first, and fall back to GPU when CPU fails. - When
litert runfails on Android device, if the device is not detected, try to runadb kill-server && adb start-serverfirst. - When benchmark using
--gcpflag, you need to- Join the EAP program in Google AI Edge Portal;
- Login to GCP using
gcloud auth login; - Set your GCP project using
--gcp=<Your-GCP-Project>;
- When
litert visualizefails to launch Model Explorer, try to runlitert visualize --stop-allfirst.
# Download only .tflite files
litert download litert-community/MobileNet-v3-large \
--file "*.tflite" \
--output mobilenet
# Download full repository
litert download litert-community/MobileNet-v3-large \
--output mobilenet_full
# Download models using Hugging Face ID (uses HF ID as model reference too)
litert download litert-community/MobileNet-v3-large
# Download models with custom model reference
litert download litert-community/MobileNet-v3-large --model-ref my_model_ref# Automated HF Conversion
litert convert Qwen/Qwen1.5-0.5B-Chat --output /tmp/qwen
# Automated HF Conversion with INT4 Weight-Only Quantization
litert convert Qwen/Qwen1.5-0.5B-Chat --quantize-recipe weight_only_wi4_afp32 --output /tmp/qwen_w4
# Generic Script Injection with INT8 Dynamic Quantization
litert convert my_model.py --quantize-recipe dynamic_wi8_afp32 --output /tmp/mymodel# Dynamic INT8 Quantization (Default)
litert quantize model.tflite \
--recipe dynamic_wi8_afp32 \
--output dynamic.tflite
# Weight-Only Quantization
litert quantize model.tflite \
--recipe weight_only_wi8_afp32 \
--output weight_only.tflite
# Static Range Quantization (requires calibration data)
litert quantize model.tflite \
--recipe static_wi8_ai8 \
--calibration-data calib_data.py \
--output static.tflite
# Custom JSON Recipe
litert quantize model.tflite \
--custom-recipe recipe.json \
--output recipe.tfliteNote
Currently only supported on Linux hosts and Qualcomm NPUs, and other NPUs are coming soon!
# Basic compilation for specific Qualcomm NPU (e.g., sm8750 in Xiaomi 15 Pro)
litert compile model.tflite --target sm8750
# Compile for multiple targets and export an AI Pack for Android
litert compile model.tflite --target sm8750 --target mt6989 --export-aipack my_npu_models# Run locally on desktop (CPU)
litert run model.tflite --desktop --cpu
litert run my_model_ref --desktop --cpu
# Run with GPU acceleration and CPU fallback (multi-accelerator)
litert run model.tflite --gpu --cpu
litert run model.tflite --accelerator gpu,cpu
# Run on connected Android device
litert run model.tflite --android
# Run on connected Android device with NPU acceleration and CPU fallback
litert run model.tflite --android --npu --cpu
litert run model.tflite --android --accelerator npu,cpu
# Run on connected Android device with NPU AOT-compiled model
litert run model_sm8450.tflite --android --npu
# Run multiple iterations and print output tensors
litert run model.tflite \
--iterations 5 \
--print-tensors
# Run with custom input formats (supports image, raw binary, numpy array)
litert run model.tflite \
--input "image.png" \
--print-tensors# Benchmark on Android (CPU side)
litert benchmark my_model_ref --android --cpu
litert benchmark model.tflite --android --cpu
# Benchmark on Android NPU (JIT mode)
litert benchmark model.tflite --android --npu
# Benchmark AOT compiled model on Android NPU
litert benchmark model_sm8450.tflite --android --npu
# Benchmark on Android GPU
litert benchmark model.tflite --android --gpu
# Benchmark on macOS (CPU)
litert benchmark my_model_ref --desktop --cpu
# Benchmark on Google AI Edge Portal in Google Cloud. Prerequisites:
# - Set up your Google AI Edge Portal account by following up the instructions at:
# https://ai.google.dev/edge/ai-edge-portal
# - Set up authentication by running: gcloud auth login
# - You can set the default GCP project by setting the environment variable LITERT_GCP_PROJECT, or by providing the --gcp-project option.
# - You can specific your GCP bucket by --gcp-bucket, otherwise, it will create default
# one.
litert benchmark model.tflite --gcp --device "pixel 7" --gcp-project "your-gcp-project-id" --gcp-bucket "your-gcp-bucket"
litert benchmark model.tflite --gcp --devices "pixel 7, sm-s931u1" --gpu# Open in Model Explorer graph
litert visualize model.tflite
# Clean up and stop visualizer background servers
litert visualize --stop-all# Import a local file into the centralized cache
litert import my_model.tflite --model-ref my_model
# Import a directory and associate with a Hugging Face ID
litert import ./my_model_dir --model-ref my_model --hf-id my_org_name/my_model# List all managed models
litert list
# Show detailed contents of a specific model
litert list my_model# Delete a model from cache
litert delete my_model# Run a generative LLM model, and load from hugging face
litert lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
--prompt="What is the capital of France?"
# Or load from local LLM model file
litert lm run gemma-4-E2B-it.litertlm
# Example with a custom prompt
litert lm run gemma-4-E2B-it.litertlm --prompt "Hello, how are you?"
# Benchmark a generative LLM model
litert lm benchmark gemma-4-E2B-it.litertlm# Clean up local cache, like model files and binaries.
litert clean