| name | litert-cli |
|---|---|
| description | LiteRT CLI tool to convert, download, quantize, run, benchmark, and visualize LiteRT models. |
This skill allows the agent to download, convert, quantize, run, benchmark, and
visualize LiteRT models using the litert command on desktop, device, or Google
Cloud.
Before running any litert commands, an agent should ensure it is in a Python
virtual environment and litert-cli is installed.
We highly recommend using uv (written in Rust) for extremely fast environment management and package installs.
Option A: Use UV (Recommended - Super Fast):
# Create a virtual environment with Python 3.13.
# We use --seed to pre-install pip, setuptools, and wheel inside the venv.
# This is critical to allow our CLI dynamic dependency auto-installers (deps.py) to function.
# When meeting dependency resolution error, try to set environment variable:
# UV_INDEX_URL=https://pypi.org/simple
uv venv --clear --python=3.13 --seed source .venv/bin/activate ```
**Option B: Use Standard Pip/Venv:** ```bash
# Create and activate a standard Python virtual environment.
python3 -m venv litert-cli-venv source litert-cli-venv/bin/activate
# Upgrade pip, setuptools, and wheel immediately.
# This is crucial to meet the project's PEP 517 requirements (setuptools>=61.0).
pip install --upgrade pip setuptools wheel ```
### 2. Check/Install LiteRT CLI
Ensure `litert-cli` and any required optional extensions (extras) are installed:
**Using UV:**
```bash
# Install in editable mode from local source
uv pip install -e .
# Or install from local source with extras (e.g., convert, lm, compile)
uv pip install -e ".[convert,lm,compile]"Using standard Pip: ```bash
pip install -e .
pip install -e ".[convert,lm,compile]"
## Core Commands
### 💡 The Model Reference (`model-ref`) System
To avoid handling complex and fragile absolute filesystem paths, the LiteRT CLI uses a centralized **Model Reference (`model-ref`)** catalog.
When you download or import a model to the centralized cache, you can assign it
a reference alias (and optional sub-references): * **Format**: `<alias_name>` or
`<alias_name>:<sub_reference>` (e.g., `mobilenet`, `resnet18:gpu`,
`efficientnet:int8`). * **Default alias**: For HuggingFace downloads, if
`--model-ref` is omitted, the CLI automatically assigns a flattened repository
ID (e.g., `litert-community__MobileNet-v3-large`) as the default alias.
Once a model is registered, **all CLI commands** (including `run`, `benchmark`,
`compile`, `delete`, `list`) accept this `<model_ref>` directly instead of a
file path! The CLI will automatically resolve it to the correct absolute cache
file path on the fly.
**Examples:**
```bash
# Run inference using the central alias directly
litert run mobilenet --android --cpu
# Benchmark using a specific sub-reference GPU file
litert benchmark resnet18:gpu --android --gpu
# Compile for NPU directly using the reference alias
litert compile efficientnet --target sm8750
# Delete from the central cache
litert delete mobilenet
Run a tflite model locally on desktop or on a adb connected Android device.
Desktop Execution (CPU and Local GPU): litert run <path_to_model> --desktop --cpu
- Output logs are clean by default.
- To enable C++ verbose debug setup logs, set the environment variable:
export LITERT_VERBOSE=1. --gpu: Use desktop GPU if available.- Accelerator Fallback: If running on GPU (
--gpu) fails, you can pass both--gpu --cpu(or--accelerator gpu,cpu). The CLI will attempt GPU first and gracefully fall back to CPU on failure.
Android Execution (CPU, GPU, or NPU): litert run <path_to_model> --android --cpu
--gpu: Run on Android GPU using OpenCL/WebGPU.- Accelerator Fallback: Similarly, pass both
--gpu --cpu(or--accelerator gpu,cpu) on Android to use CPU as a fallback if GPU delegate creation fails. --npu: Run on Android device NPU. Supports two execution paradigms based on the input model:
1. JIT (Just-In-Time) compilation mode: Pass a standard, non-compiled
.tflite model. The on-device LiteRT runtime will automatically download/invoke
the vendor-specific compiler plugin to compile operators dynamically at graph
initialization time. bash litert run standard_model.tflite --android --npu
2. AOT (Ahead-Of-Time) execution mode: Pass an already NPU-compiled
.tflite model (compiled offline via litert compile). The on-device runtime
loads the compiled binary block directly on the NPU. This avoids
graph-compilation warmup overhead, leading to sub-millisecond initialization
latency. bash litert run resnet18_compiled_sm8750.tflite --android --npu
Multi-Input Formats (Literals or Arrays): bash litert run model.tflite --desktop --input inputs="[0.5, 0.5, 0.5]" --print-tensors
Multi-Input Formats (Files - .npy, .raw, .png): bash litert run model.tflite --desktop --input inputs="test_input.npy" --print-tensors
Standard Selection: bash litert quantize <path_to_model> --output <output_path>
Dynamic Quantization (dynamic_wi8_afp32): bash litert quantize model.tflite --recipe dynamic_wi8_afp32 --output dynamic.tflite
Static Quantization with Calibration Data: bash litert quantize model.tflite --recipe static_wi8_ai8 --calibration-data "calib_data.py" --output static.tflite
Custom JSON Recipe: bash litert quantize model.tflite --custom-recipe "recipe.json" --output recipe.tflite
Launch the Model Explorer to visualize the model structure.
litert visualize <path_to_model>Advanced Visualization Options:
--no-reuse-server: Force creation of a NEW server port.--stop-all: Kill all existing visualization background servers.
litert visualize --stop-allDownload public LiteRT models from HuggingFace Hub or direct URL.
litert download <repo_id_or_url> --output <output_dir>Key Behavioral Nuance (Custom Output vs Centralized Cache):
- Direct URL Downloads: Metadata is never saved.
- HuggingFace Downloads (Default Central Cache): If
--outputis omitted, it downloads to~/.cache/litert-cli/models/and automatically createsmetadata.jsonto catalog the model for CLI commands (likelitert list). - HuggingFace Downloads (Custom Folder): If
--outputis provided, it acts as a pure, clean download of only the model files. It does not generate ametadata.jsonfile in the output folder.
Filter by File Type:
litert download litert-community/MobileNet-v3-large --file "*.tflite" --output ./modelsWith Custom Model Reference: bash litert download litert-community/MobileNet-v3-large --model-ref my_model_ref
Import a local file or directory into the centralized cache.
litert import my_model.tflite --model-ref my_modelList all managed models or detailed contents of a specific model.
litert list
litert list my_modelConvert a PyTorch or HuggingFace model into a LiteRT model.
From HuggingFace Model Hub:
litert convert Qwen/Qwen1.5-0.5B-Chat --output /tmp/qwen
# With INT4 Weight-Only Quantization (Recommended for LLM)
litert convert Qwen/Qwen1.5-0.5B-Chat --quantize-recipe weight_only_wi4_afp32 --output /tmp/qwen_w4From Generic Python Script: ```bash litert convert my_model.py --output /tmp/mymodel
litert convert my_model.py --quantize-recipe dynamic_wi8_afp32 --output
/tmp/mymodel_quant ```*--model-func: Name of function that returns the model
(`torch.nn.Module`). Default:`get_model`. *`--input-func`: Name of function that
returns sample inputs. Default:`get_args`.
*`--quantize-recipe`(Alias`--quantize`): Quantization recipe to apply
(e.g.,`dynamic_wi8_afp32`,`weight_only_wi4_afp32`).
Interact with LLM generative models (like Qwen 1.5 or Gemma 4) using native
litert-lm CLI.
[!TIP] Non-interactive / Background Execution best practice: When running generative LLM inferences using the
litert lm runcommand in scripts or in the background, the process will hang indefinitely waiting for the next chat prompt on standard input (stdin). To prevent this and ensure it outputs the prompt response and exits immediately, always redirect stdin from/dev/null(i.e., append< /dev/nullto the command).
litert lm run <model_path_or_reference_id> < /dev/nullRun with model file path:
# Generative LLM models require the path to the compiled .litertlm model file or directory.
# Append < /dev/null to exit immediately after printing the answer.
litert lm run <model_dir>/model.litertlm --prompt "What is edge AI?" <
/dev/null ```
**Download and run with HuggingFace repo:** `bash litert lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \ --prompt="What is the capital of France?" \ <
/dev/null`
### 9. Benchmark
Benchmark LiteRT models on different platforms (Android, Google Cloud, or
Desktop).
**On connected Android device via ADB (CPU, GPU, or NPU):**
```bash
# Benchmark on CPU (Default)
litert benchmark model.tflite --android --cpu
# Benchmark on NPU (Requires compiling for NPU first)
litert benchmark model.tflite --android --npu
# Benchmark on GPU (using OpenCL/OpenGL delegates)
litert benchmark model.tflite --android --gpuOn Macbook (CPU): bash litert benchmark my_model_ref --desktop --cpu
On Google AI Edge Portal in Google Cloud (GCP):
[!IMPORTANT] Prerequisites for GCP Benchmarking: 1. Join Google AI Edge Portal early access program at: https://ai.google.dev/edge/ai-edge-portal 2. Authenticate your terminal session by running:
gcloud auth login3. Configure the GCP Project ID. You can either: * Set the environment variable:export LITERT_GCP_PROJECT="your-gcp-project-id"* Or explicitly pass the--gcp-projectoption in the command. 4. Configure the Google Cloud Storage (GCS) Bucket for model uploading. The CLI resolves it via: * Explicit--gcp-bucketCLI option. *LITERT_GCP_BUCKETenvironment variable. * Default fallback: Automatically creates and usesgs://{gcp_project}-litert-models.
# Benchmark on GCP Pixel 7 CPU (using default auto-created project bucket)
litert benchmark model.tflite --gcp --device "pixel 7" --gcp-project "your-gcp-project-id"
# Benchmark on GCP Pixel 7 CPU (specifying custom GCS bucket explicitly)
litert benchmark model.tflite --gcp --device "pixel 7" --gcp-project "your-gcp-project-id" --gcp-bucket "your-custom-bucket"
# Benchmark on multiple devices at once on GPU
litert benchmark model.tflite --gcp --devices "pixel 7, sm-s931u1" --gpu --gcp-project "your-gcp-project-id"Apply Ahead-of-Time (AOT) offline compilation to a standard LiteRT (.tflite) model for specific edge SoC target NPUs (e.g., Qualcomm sm8550, MediaTek mt6989).
Basic target NPU compilation:
litert compile my_model.tflite --target sm8750Compile for multiple NPU targets and export an Android AI Pack (for PODAI
deployment): bash litert compile my_model.tflite --target sm8550 --target mt6989 --export-aipack my_npu_models
Compile and specify a custom output directory: bash litert compile my_model.tflite --target sm8750 --output-dir ./compiled
Update target SoC metadata configurations from GitHub repository: ```bash
litert compile --update-targets main ```
Delete a managed model from the centralized cache.
litert delete my_modelClean up model cache, etc.
litert cleanAgents should run tests after modifying code to ensure no regressions.
To run unit tests locally:
python litert_cli/litert_test.py
python litert_cli/litert_help_test.pyTo run comprehensive end-to-end regression tests: bash ./examples/run_smoke_tests.sh ./examples/run_commands.sh ./examples/run_models.sh
- Pipe outputs to text files or grep them if you are looking for specific tensor shapes or runtime metrics.
- Avoid hanging background processes: When executing the
litert lm runcommand in a script or in the background, always append< /dev/nullto redirect standard input. Otherwise, the process will block indefinitely waiting on stdin.
These complex prompts showcase how to combine and leverage this skill. You can use them directly in agent queries:
"Download the FP32 EfficientNet model
litert-community/efficientnet_b1from HuggingFace. Quantize it to INT8 dynamic range (--recipe dynamic_wi8_afp32), then benchmark both the original FP32 model and the newly quantized INT8 model on the GPU of my connected Android device. Compare the average latency and report the throughput speedup."
"Create a python envinroment with UV, install
litert-cli, convert the modelQwen/Qwen1.5-0.5B-Chatfrom HuggingFace Hub to LiteRT format, and run it locally using the prompt 'Explain edge machine learning in one sentence'."
"Download EfficientNet from huggingface repo
litert-community/efficientnet_b1. On a Linux host machine, offline compile (AOT) the model for thesm8750target NPU, and output the compiled model inside./models/compiled. Then, run an on-device inference and benchmark using this newly compiled AOT model on the connected Android device's NPU (--npu). Confirm that the compiled graph loads directly without dynamic JIT compilation warmup latency."