Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .agents/skills/litert_cli/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,9 +112,11 @@ Run a tflite model locally on desktop or on a adb connected Android device.
* Output logs are **clean by default**.
* To enable C++ verbose debug setup logs, set the environment variable: `export LITERT_VERBOSE=1`.
* `--gpu`: Use desktop GPU if available.
* **Accelerator Fallback**: If running on GPU (`--gpu`) fails, you can pass both **`--gpu --cpu`** (or `--accelerator gpu,cpu`). The CLI will attempt GPU first and gracefully fall back to CPU on failure.

**Android Execution (CPU, GPU, or NPU):** `litert run <path_to_model> --android --cpu`
* `--gpu`: Run on Android GPU using OpenCL/WebGPU.
* **Accelerator Fallback**: Similarly, pass both **`--gpu --cpu`** (or `--accelerator gpu,cpu`) on Android to use CPU as a fallback if GPU delegate creation fails.
* `--npu`: Run on Android device NPU. Supports **two execution paradigms** based on the input model:

**1. JIT (Just-In-Time) compilation mode:** Pass a standard, non-compiled
Expand Down
69 changes: 56 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ including converting, quantizing, compiling, managing, running, and benchmarking
LiteRT (TFLite) models on various hardware (CPU / GPU / NPU) across platforms
(desktop, mobile, or cloud).

> [!NOTE] It's a still early preview release under active development, thus has
> [!NOTE]
> It's a still early preview release under active development, thus has
> limited platform and feature support, plus possible bugs. We appreciate your
> patience and feedback to help us improve it.

Expand All @@ -23,7 +24,7 @@ We support installation using either
for ultra-fast dependency resolution) or standard
**[pip](https://pip.pypa.io/)** within a Python virtual environment.

#### Option 1: Use UV (Recommended)
### Option 1: Use UV (Recommended)

`uv` is an extremely fast Python package manager written in Rust.

Expand All @@ -50,7 +51,7 @@ pip install -q litert-cli-nightly
litert --help
```

#### Option 3. Install from Local Clone (for development)
### Option 3: Install from Local Clone (for development)

```bash
uv venv --clear --python=3.13 --seed
Expand Down Expand Up @@ -92,8 +93,8 @@ litert benchmark efficientnet/efficientnet_b1.tflite --android --gpu

### Quick Demos

Check comprehensive usage examples under the `examples/` directory, which
contains per-command demos and model-specific demos.
Check comprehensive usage examples under the [examples/](https://github.com/google-ai-edge/LiteRT-CLI/tree/main/examples)
directory, which contains per-command demos and model-specific demos.

If you have cloned the repo, you can run the following commands to see the
demos:
Expand All @@ -109,7 +110,7 @@ demos:
./examples/run_models.sh efficientnet
```

## 🤖 Use in Coding Agent
### 🤖 Use in Coding Agent

Add the LiteRT CLI skill
[`SKILL.md`]([file:///.agents/skills/litert_cli/SKILL.md]\(https://github.com/google-ai-edge/LiteRT-CLI/blob/main/.agents/skills/litert_cli/SKILL.md\))
Expand All @@ -121,6 +122,19 @@ into your AI coding agent (like Google Antigravity) and try prompts such as:
* "Compile LiteRT model `litert-community/efficientnet_b1` for NPU target
`sm8750`"
* "Visualize LiteRT model `litert-community/efficientnet_b1`"
* "Download the FP32 EfficientNet model `litert-community/efficientnet_b1` from
HuggingFace. Quantize it to INT8 dynamic range (`--recipe dynamic_wi8_afp32`),
then benchmark both the original FP32 model and the newly quantized INT8 model
on the GPU of my connected Android device. Compare the average latency and
report the throughput speedup."
* "convert the model `Qwen/Qwen1.5-0.5B-Chat` from HuggingFace Hub to LiteRT format,
and run it locally using the prompt 'Explain edge machine learning in one sentence'."
* "Download EfficientNet from huggingface repo `litert-community/efficientnet_b1`
. Offline compile (AOT) the model for the `sm8750` target NPU, and output
the compiled model into `./models/compiled`. Then, run an on-device inference
and benchmark using this newly compiled AOT model on the connected Android
device's NPU (`--npu`). Confirm that the graph loads directly without
dynamic JIT compilation warmup latency."

The agent will automatically install the necessary tools, including Python
virtual environments, `litert-cli-nightly`, and all required dependencies.
Expand All @@ -137,7 +151,27 @@ Verified in Python 3.13.
* Windows: partially supported
* **Android**:
* CPU, GPU
* NPU: Qualcomm (supported), MediaTek (soon), Google Tensor (soon)
* NPU: Qualcomm, MediaTek (soon), Google Tensor (soon)

--------------------------------------------------------------------------------

### Troubleshooting & Tips

* Always active the virtual environment before running `litert` command, to avoid conflicts.
* When `uv` fails to resolve dependencies, try to set environment variable:
`export UV_INDEX_URL=https://pypi.org/simple` before running `uv` command.
* `litert compile` only supports running on Linux now, and it requires newer
Clang has version `18.x.x` or above. Try
`sudo apt install clang libc++-dev libc++abi-dev`
* When run fails on GPU using `--gpu` flag, try to add both `--cpu --gpu` flags
in the command, then the CLI will try CPU first, and fall back to GPU when CPU fails.
* When `litert run` fails on Android device, if the device is not detected, try to
run `adb kill-server && adb start-server` first.
* When benchmark using `--gcp` flag, you need to
1) [Join the EAP program in Google AI Edge Portal](https://ai.google.dev/edge/ai-edge-portal);
2) Login to GCP using `gcloud auth login`;
3) Set your GCP project using `--gcp=<Your-GCP-Project>`;
* When `litert visualize` fails to launch Model Explorer, try to run `litert visualize --stop-all` first.

--------------------------------------------------------------------------------

Expand Down Expand Up @@ -202,7 +236,8 @@ litert quantize model.tflite \

### 4. Compile a LiteRT model for NPU AOT

> [!NOTE] Currently only supported on Linux hosts and Qualcomm NPUs.
> [!NOTE]
> Currently only supported on Linux hosts and Qualcomm NPUs, and other NPUs are coming soon!

```bash
# Basic compilation for specific Qualcomm NPU (e.g., sm8750 in Xiaomi 15 Pro)
Expand Down Expand Up @@ -236,12 +271,12 @@ litert run model_sm8450.tflite --android --npu
# Run multiple iterations and print output tensors
litert run model.tflite \
--iterations 5 \
--print_tensors
--print-tensors

# Run with custom input formats (supports image, raw binary, numpy array)
litert run model.tflite \
--input "image.png" \
--print_tensors
--print-tensors
```

### 6. Benchmark a model's performance
Expand Down Expand Up @@ -311,19 +346,27 @@ litert list my_model
litert delete my_model
```

### 11. Run a generative LLM model using LiteRT-LM CLI
### 11. Run and benchmark a generative LLM model using LiteRT-LM CLI

```bash
# Run a generative LLM model
# Run a generative LLM model, and load from hugging face
litert lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
--prompt="What is the capital of France?"

# Or load from local LLM model file
litert lm run gemma-4-E2B-it.litertlm

# Example with a custom prompt
litert lm run gemma-4-E2B-it.litertlm --prompt "Hello, how are you?"

# Benchmark a generative LLM model
litert lm benchmark gemma-4-E2B-it.litertlm
```

### 12. Clean up all caches

```bash
# Clean up model cache, etc.
# Clean up local cache, like model files and binaries.
litert clean
```
14 changes: 3 additions & 11 deletions examples/commands/convert_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,23 +29,15 @@ echo -e "\n${BLUE}${BOLD}--- 1. Generic Script Mode (resnet18.py) ---${NC}"
run_case "Convert: PyTorch ResNet18 Base" \
litert convert resnet18.py --output models/resnet18_base

# 1.2 Conversion with Quantization (pt2e_dynamic)
run_case "Convert: PyTorch ResNet18 with PT2E Dynamic Quantization" \
litert convert resnet18.py --output models/resnet18_pt2e --quantize pt2e_dynamic

# 1.3 Conversion with Model Args (e.g. batch_size=4)
# 1.2 Conversion with Model Args (e.g. batch_size=4)
run_case "Convert: PyTorch ResNet18 with Model Args (batch_size=4)" \
litert convert resnet18.py --output models/resnet18_b4 --model-args "batch_size=4"

# 1.4 Conversion with Quantization (pt2e_per_channel)
run_case "Convert: PyTorch ResNet18 with PT2E Per-Channel Quantization" \
litert convert resnet18.py --output models/resnet18_pt2e_pc --quantize-recipe pt2e_per_channel

# 1.5 Conversion with Quantization (dynamic_wi8_afp32)
# 1.3 Conversion with Quantization (dynamic_wi8_afp32)
run_case "Convert: PyTorch ResNet18 with Dynamic INT8 Recipe" \
litert convert resnet18.py --output models/resnet18_dyn_wi8 --quantize-recipe dynamic_wi8_afp32

# 1.6 Conversion with Quantization (weight_only_wi8_afp32)
# 1.4 Conversion with Quantization (weight_only_wi8_afp32)
run_case "Convert: PyTorch ResNet18 with Weight-Only INT8 Recipe" \
litert convert resnet18.py --output models/resnet18_wo_wi8 --quantize-recipe weight_only_wi8_afp32

Expand Down
4 changes: 2 additions & 2 deletions examples/models/gemma4.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ source "$(dirname "${BASH_SOURCE[0]}")/../utils.sh"
setup_test_env "gemma4" "Gemma4 LLM Demo Script"

# --- 1. Convert HuggingFace Model google/gemma-4-E2B-it ---
# TODO: Bring this back when we add support for --externalize_embedder in CLI convert command.
# Wait for LiteRT Torch release.
# run_case "Convert: HuggingFace google/gemma-4-E2B-it" \
# litert convert google/gemma-4-E2B-it --output "models/gemma4"
# litert convert google/gemma-4-E2B-it --output "models/gemma4"

# --- 2. Run Gemma4 Generative LLM Model ---
run_case "Run Gemma4: Generative inference with custom prompt" \
Expand Down
2 changes: 2 additions & 0 deletions examples/models/resnet.sh
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ if has_android_device; then
litert run "$RESNET_TFLITE" --android --cpu --iterations 1

# ResNet18 PADV2 op is currently not fully supported by Android OpenCL GPU delegate.
run_case "Run: ResNet18 FP32 on Android GPU with CPU fallback" \
litert run "$RESNET_TFLITE" --android --cpu --gpu --iterations 1

run_case "Run: ResNet18 Dynamic INT8 on Android (CPU)" \
litert run "models/resnet18/resnet18_int8_dynamic.tflite" --android --cpu --iterations 1
Expand Down
30 changes: 24 additions & 6 deletions litert_cli/commands/convert/huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@
from ai_edge_litert.aot import aot_compile as aot_lib
from ai_edge_litert.aot.ai_pack import export_lib as ai_pack_export


def convert_huggingface(
model: str,
output: str,
Expand Down Expand Up @@ -69,10 +68,16 @@ def convert_huggingface(
model, trust_remote_code=True
)
architectures = getattr(config, "architectures", [])
if not any("CausalLM" in arch for arch in architectures):
is_causal_lm = any("CausalLM" in arch for arch in architectures)
is_gemma3 = any("Gemma3ForConditionalGeneration" in arch for arch in architectures)
is_gemma3n = any("Gemma3nForConditionalGeneration" in arch for arch in architectures)
is_gemma4 = any("Gemma4ForConditionalGeneration" in arch for arch in architectures)
is_gemma_vlm = is_gemma3 or is_gemma3n or is_gemma4

if not (is_causal_lm or is_gemma_vlm):
raise ValueError(
f"Currently only AutoModelForCausalLM is supported. Model '{model}'"
f" has architectures {architectures}."
f"Currently only AutoModelForCausalLM or Gemma VLM architectures (Gemma3, Gemma3n, Gemma4) are supported. "
f"Model '{model}' has architectures {architectures}."
)
except Exception as e:
if isinstance(e, ValueError):
Expand All @@ -84,15 +89,28 @@ def convert_huggingface(

# Call the auto-export function from litert_torch.
# It automatically saves to the output.
task = "text_generation"
export_kwargs = {}
use_jinja_template = is_gemma4
if is_gemma_vlm:
task = "image_text_to_text"
export_kwargs["export_vision_encoder"] = True
export_kwargs["externalize_embedder"] = True
if is_gemma4:
export_kwargs["jinja_chat_template_override"] = "litert-community/gemma-4-E2B-it-litert-lm"


hf_export.export(
model=model,
output_dir=output,
task="text_generation",
task=task,
quantization_recipe=quantize,
prefill_lengths=parsed_prefill,
cache_length=cache_length,
bundle_litert_lm=bundle_litert_lm,
use_jinja_template=False,
trust_remote_code=False,
use_jinja_template=use_jinja_template,
**export_kwargs,
)

if target:
Expand Down
Loading
Loading