google-ai-edge · copybara-service · May 19, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/.agents/skills/litert_cli/SKILL.md b/.agents/skills/litert_cli/SKILL.md
@@ -112,9 +112,11 @@ Run a tflite model locally on desktop or on a adb connected Android device.
 * Output logs are **clean by default**.
 * To enable C++ verbose debug setup logs, set the environment variable: `export LITERT_VERBOSE=1`.
 * `--gpu`: Use desktop GPU if available.
+* **Accelerator Fallback**: If running on GPU (`--gpu`) fails, you can pass both **`--gpu --cpu`** (or `--accelerator gpu,cpu`). The CLI will attempt GPU first and gracefully fall back to CPU on failure.
 
 **Android Execution (CPU, GPU, or NPU):** `litert run <path_to_model> --android --cpu`
 * `--gpu`: Run on Android GPU using OpenCL/WebGPU.
+* **Accelerator Fallback**: Similarly, pass both **`--gpu --cpu`** (or `--accelerator gpu,cpu`) on Android to use CPU as a fallback if GPU delegate creation fails.
 * `--npu`: Run on Android device NPU. Supports **two execution paradigms** based on the input model:
 
 **1. JIT (Just-In-Time) compilation mode:** Pass a standard, non-compiled

diff --git a/README.md b/README.md
@@ -6,7 +6,8 @@ including converting, quantizing, compiling, managing, running, and benchmarking
 LiteRT (TFLite) models on various hardware (CPU / GPU / NPU) across platforms
 (desktop, mobile, or cloud).
 
-> [!NOTE] It's a still early preview release under active development, thus has
+> [!NOTE]
+> It's a still early preview release under active development, thus has
 > limited platform and feature support, plus possible bugs. We appreciate your
 > patience and feedback to help us improve it.
 
@@ -23,7 +24,7 @@ We support installation using either
 for ultra-fast dependency resolution) or standard
 **[pip](https://pip.pypa.io/)** within a Python virtual environment.
 
-#### Option 1: Use UV (Recommended)
+### Option 1: Use UV (Recommended)
 
 `uv` is an extremely fast Python package manager written in Rust.
 
@@ -50,7 +51,7 @@ pip install -q litert-cli-nightly
 litert --help
 ```
 
-#### Option 3. Install from Local Clone (for development)
+### Option 3: Install from Local Clone (for development)
 
 ```bash
 uv venv --clear --python=3.13 --seed
@@ -92,8 +93,8 @@ litert benchmark efficientnet/efficientnet_b1.tflite --android --gpu
 
 ### Quick Demos
 
-Check comprehensive usage examples under the `examples/` directory, which
-contains per-command demos and model-specific demos.
+Check comprehensive usage examples under the [examples/](https://github.com/google-ai-edge/LiteRT-CLI/tree/main/examples)
+directory, which contains per-command demos and model-specific demos.
 
 If you have cloned the repo, you can run the following commands to see the
 demos:
@@ -109,7 +110,7 @@ demos:
 ./examples/run_models.sh efficientnet
 ```
 
-## 🤖 Use in Coding Agent
+### 🤖 Use in Coding Agent
 
 Add the LiteRT CLI skill
 [`SKILL.md`]([file:///.agents/skills/litert_cli/SKILL.md]\(https://github.com/google-ai-edge/LiteRT-CLI/blob/main/.agents/skills/litert_cli/SKILL.md\))
@@ -121,6 +122,19 @@ into your AI coding agent (like Google Antigravity) and try prompts such as:
 *   "Compile LiteRT model `litert-community/efficientnet_b1` for NPU target
     `sm8750`"
 *   "Visualize LiteRT model `litert-community/efficientnet_b1`"
+*   "Download the FP32 EfficientNet model `litert-community/efficientnet_b1` from
+    HuggingFace. Quantize it to INT8 dynamic range (`--recipe dynamic_wi8_afp32`),
+    then benchmark both the original FP32 model and the newly quantized INT8 model
+    on the GPU of my connected Android device. Compare the average latency and
+    report the throughput speedup."
+*   "convert the model `Qwen/Qwen1.5-0.5B-Chat` from HuggingFace Hub to LiteRT format, 
+    and run it locally using the prompt 'Explain edge machine learning in one sentence'."
+*   "Download EfficientNet from huggingface repo `litert-community/efficientnet_b1`
+    . Offline compile (AOT) the model for the `sm8750` target NPU, and output 
+    the compiled model into `./models/compiled`. Then, run an on-device inference 
+    and benchmark using this newly compiled AOT model on the connected Android 
+    device's NPU (`--npu`). Confirm that the graph loads directly without 
+    dynamic JIT compilation warmup latency."
 
 The agent will automatically install the necessary tools, including Python
 virtual environments, `litert-cli-nightly`, and all required dependencies.
@@ -137,7 +151,27 @@ Verified in Python 3.13.
     *   Windows: partially supported
 *   **Android**:
     *   CPU, GPU
-    *   NPU: Qualcomm (supported), MediaTek (soon), Google Tensor (soon)
+    *   NPU: Qualcomm, MediaTek (soon), Google Tensor (soon)
+
+--------------------------------------------------------------------------------
+
+### Troubleshooting & Tips
+
+* Always active the virtual environment before running `litert` command, to avoid conflicts.
+* When `uv` fails to resolve dependencies, try to set environment variable:
+  `export UV_INDEX_URL=https://pypi.org/simple` before running `uv` command.
+* `litert compile` only supports running on Linux now, and it requires newer
+  Clang has version `18.x.x` or above. Try
+  `sudo apt install clang libc++-dev libc++abi-dev`
+* When run fails on GPU using `--gpu` flag, try to add both `--cpu --gpu` flags
+  in the command, then the CLI will try CPU first, and fall back to GPU when CPU fails.
+* When `litert run` fails on Android device, if the device is not detected, try to
+  run `adb kill-server && adb start-server` first.
+* When benchmark using `--gcp` flag, you need to
+  1) [Join the EAP program in Google AI Edge Portal](https://ai.google.dev/edge/ai-edge-portal);
+  2) Login to GCP using `gcloud auth login`; 
+  3) Set your GCP project using `--gcp=<Your-GCP-Project>`;
+* When `litert visualize` fails to launch Model Explorer, try to run `litert visualize --stop-all` first.
 
 --------------------------------------------------------------------------------
 
@@ -202,7 +236,8 @@ litert quantize model.tflite \
 
 ### 4. Compile a LiteRT model for NPU AOT
 
-> [!NOTE] Currently only supported on Linux hosts and Qualcomm NPUs.
+> [!NOTE]
+> Currently only supported on Linux hosts and Qualcomm NPUs, and other NPUs are coming soon!
 
 ```bash
 # Basic compilation for specific Qualcomm NPU (e.g., sm8750 in Xiaomi 15 Pro)
@@ -236,12 +271,12 @@ litert run model_sm8450.tflite --android --npu
 # Run multiple iterations and print output tensors
 litert run model.tflite \
   --iterations 5 \
-  --print_tensors
+  --print-tensors
 
 # Run with custom input formats (supports image, raw binary, numpy array)
 litert run model.tflite \
   --input "image.png" \
-  --print_tensors
+  --print-tensors
 ```
 
 ### 6. Benchmark a model's performance
@@ -311,19 +346,27 @@ litert list my_model
 litert delete my_model
 ```
 
-### 11. Run a generative LLM model using LiteRT-LM CLI
+### 11. Run and benchmark a generative LLM model using LiteRT-LM CLI
 
 ```bash
-# Run a generative LLM model
+# Run a generative LLM model, and load from hugging face
+litert lm run \
+  --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
+  --prompt="What is the capital of France?"
+
+# Or load from local LLM model file
 litert lm run gemma-4-E2B-it.litertlm
 
 # Example with a custom prompt
 litert lm run gemma-4-E2B-it.litertlm --prompt "Hello, how are you?"
+
+# Benchmark a generative LLM model
+litert lm benchmark gemma-4-E2B-it.litertlm
 ```
 
 ### 12. Clean up all caches
 
 ```bash
-# Clean up model cache, etc.
+# Clean up local cache, like model files and binaries.
 litert clean
 ```
diff --git a/examples/commands/convert_test.sh b/examples/commands/convert_test.sh
@@ -29,23 +29,15 @@ echo -e "\n${BLUE}${BOLD}--- 1. Generic Script Mode (resnet18.py) ---${NC}"
 run_case "Convert: PyTorch ResNet18 Base" \
     litert convert resnet18.py --output models/resnet18_base
 
-# 1.2 Conversion with Quantization (pt2e_dynamic)
-run_case "Convert: PyTorch ResNet18 with PT2E Dynamic Quantization" \
-    litert convert resnet18.py --output models/resnet18_pt2e --quantize pt2e_dynamic
-
-# 1.3 Conversion with Model Args (e.g. batch_size=4)
+# 1.2 Conversion with Model Args (e.g. batch_size=4)
 run_case "Convert: PyTorch ResNet18 with Model Args (batch_size=4)" \
     litert convert resnet18.py --output models/resnet18_b4 --model-args "batch_size=4"
 
-# 1.4 Conversion with Quantization (pt2e_per_channel)
-run_case "Convert: PyTorch ResNet18 with PT2E Per-Channel Quantization" \
-    litert convert resnet18.py --output models/resnet18_pt2e_pc --quantize-recipe pt2e_per_channel
-
-# 1.5 Conversion with Quantization (dynamic_wi8_afp32)
+# 1.3 Conversion with Quantization (dynamic_wi8_afp32)
 run_case "Convert: PyTorch ResNet18 with Dynamic INT8 Recipe" \
     litert convert resnet18.py --output models/resnet18_dyn_wi8 --quantize-recipe dynamic_wi8_afp32
 
-# 1.6 Conversion with Quantization (weight_only_wi8_afp32)
+# 1.4 Conversion with Quantization (weight_only_wi8_afp32)
 run_case "Convert: PyTorch ResNet18 with Weight-Only INT8 Recipe" \
     litert convert resnet18.py --output models/resnet18_wo_wi8 --quantize-recipe weight_only_wi8_afp32
 

diff --git a/examples/models/gemma4.sh b/examples/models/gemma4.sh
@@ -24,9 +24,9 @@ source "$(dirname "${BASH_SOURCE[0]}")/../utils.sh"
 setup_test_env "gemma4" "Gemma4 LLM Demo Script"
 
 # --- 1. Convert HuggingFace Model google/gemma-4-E2B-it ---
-# TODO: Bring this back when we add support for --externalize_embedder in CLI convert command.
+# Wait for LiteRT Torch release.
 # run_case "Convert: HuggingFace google/gemma-4-E2B-it" \
-#     litert convert google/gemma-4-E2B-it --output "models/gemma4"
+#    litert convert google/gemma-4-E2B-it --output "models/gemma4"
 
 # --- 2. Run Gemma4 Generative LLM Model ---
 run_case "Run Gemma4: Generative inference with custom prompt" \

diff --git a/examples/models/resnet.sh b/examples/models/resnet.sh
@@ -65,6 +65,8 @@ if has_android_device; then
         litert run "$RESNET_TFLITE" --android --cpu --iterations 1
 
     # ResNet18 PADV2 op is currently not fully supported by Android OpenCL GPU delegate.
+    run_case "Run: ResNet18 FP32 on Android GPU with CPU fallback" \
+        litert run "$RESNET_TFLITE" --android --cpu --gpu --iterations 1
 
     run_case "Run: ResNet18 Dynamic INT8 on Android (CPU)" \
         litert run "models/resnet18/resnet18_int8_dynamic.tflite" --android --cpu --iterations 1

diff --git a/litert_cli/commands/convert/huggingface.py b/litert_cli/commands/convert/huggingface.py
@@ -31,7 +31,6 @@
 from ai_edge_litert.aot import aot_compile as aot_lib
 from ai_edge_litert.aot.ai_pack import export_lib as ai_pack_export
 
-
 def convert_huggingface(
     model: str,
     output: str,
@@ -69,10 +68,16 @@ def convert_huggingface(
           model, trust_remote_code=True
       )
       architectures = getattr(config, "architectures", [])
-      if not any("CausalLM" in arch for arch in architectures):
+      is_causal_lm = any("CausalLM" in arch for arch in architectures)
+      is_gemma3 = any("Gemma3ForConditionalGeneration" in arch for arch in architectures)
+      is_gemma3n = any("Gemma3nForConditionalGeneration" in arch for arch in architectures)
+      is_gemma4 = any("Gemma4ForConditionalGeneration" in arch for arch in architectures)
+      is_gemma_vlm = is_gemma3 or is_gemma3n or is_gemma4
+
+      if not (is_causal_lm or is_gemma_vlm):
         raise ValueError(
-            f"Currently only AutoModelForCausalLM is supported. Model '{model}'"
-            f" has architectures {architectures}."
+            f"Currently only AutoModelForCausalLM or Gemma VLM architectures (Gemma3, Gemma3n, Gemma4) are supported. "
+            f"Model '{model}' has architectures {architectures}."
         )
     except Exception as e:
       if isinstance(e, ValueError):
@@ -84,15 +89,28 @@ def convert_huggingface(
 
     # Call the auto-export function from litert_torch.
     # It automatically saves to the output.
+    task = "text_generation"
+    export_kwargs = {}
+    use_jinja_template = is_gemma4
+    if is_gemma_vlm:
+      task = "image_text_to_text"
+      export_kwargs["export_vision_encoder"] = True
+      export_kwargs["externalize_embedder"] = True
+      if is_gemma4:
+        export_kwargs["jinja_chat_template_override"] = "litert-community/gemma-4-E2B-it-litert-lm"
+
+
     hf_export.export(
         model=model,
         output_dir=output,
-        task="text_generation",
+        task=task,
         quantization_recipe=quantize,
         prefill_lengths=parsed_prefill,
         cache_length=cache_length,
         bundle_litert_lm=bundle_litert_lm,
-        use_jinja_template=False,
+        trust_remote_code=False,
+        use_jinja_template=use_jinja_template,
+        **export_kwargs,
     )
 
     if target: