Skip to content

feat(npu): restore NPU acceleration on upstream ORT via raw OpenVINO bypass#7

Open
Smiie-2 wants to merge 23 commits into
AssafWoo:mainfrom
Smiie-2:feat/npu-on-ort
Open

feat(npu): restore NPU acceleration on upstream ORT via raw OpenVINO bypass#7
Smiie-2 wants to merge 23 commits into
AssafWoo:mainfrom
Smiie-2:feat/npu-on-ort

Conversation

@Smiie-2

@Smiie-2 Smiie-2 commented May 2, 2026

Copy link
Copy Markdown

Summary

Adds opt-in Intel NPU acceleration for the embedding model via a direct OpenVINO C-API path. Default behavior (CPU via ORT) is unchanged.

What this adds

  • New openvino cargo feature on panda-core (forwarded by panda), off by default
  • ccr-core/src/ov_embed.rs — raw OpenVINO embedder (loads ONNX from HF-hub, reshapes to static shapes, compiles via OpenVINO, runs a warm InferRequest pool)
  • OV_EMBEDDER static + get/preload/is_active accessors in summarizer
  • OV dispatch wired ahead of ORT-CPU in embed_and_normalize and embed_direct
  • Daemon eagerly preloads OV at startup so the multi-second NPU compile happens once
  • New execution_provider = "auto" | "cpu" | "npu" field in [global] (also PANDA_NPU env var)
  • PANDA_NPU_STRICT=1 makes OV init failure fatal in the daemon (default is graceful CPU fallback)
  • README "NPU support (opt-in, Intel only)" section: build flag, library discovery order, config knobs, env vars, OV cache location, warm-pool sizing

How it dispatches

  1. execution_provider != "npu" → unchanged ORT-CPU path
  2. execution_provider == "npu" and openvino feature enabled → OV embedder
  3. OV init or inference failure → degrade to ORT-CPU (or hard-fail under PANDA_NPU_STRICT=1)

Without the cargo feature flag this is a no-op for existing CPU users.

Tests

  • ccr-core/tests/npu_smoke.rs — asserts ov_embedder_is_active() after embedding (catches silent CPU fallback)
  • ccr-core/tests/npu_fallback.rs — verifies CPU fallback when libopenvino_c.so is hidden via OPENVINO_LIB_PATH=/dev/null
  • Unit tests for ov_embed pure functions
  • Both integration tests live in their own binaries to isolate the OV_EMBEDDER OnceCell across processes
  • Tests are gated behind OPENVINO_NPU_AVAILABLE=1 so they only run on hardware with an NPU

Verified on

Intel Meteor Lake NPU 3720, OpenVINO 2024 runtime. Cold init ~760ms, warm embed ~540ms, InferRequest pool size 4. Smoke + fallback tests pass; panda run cargo build logs [panda] embedder: AllMiniLML6V2 on NPU (raw OpenVINO).

Supported models on the NPU path

AllMiniLML6V2 (default) and AllMiniLML12V2. Other models in the broader enum still use the ORT-CPU path. Adding more is a one-line entry in model_seq_len.

Smiie-2 added 23 commits May 1, 2026 19:56
Design for re-grafting NPU acceleration as an opt-in ort OpenVINO EP feature on top of upstream/main (post-ort-migration), with the embedding daemon owning the warm NPU session.
…/openvino so that --features openvino enables the OpenVINO execution provider for Intel NPU/GPU/CPU. No default-build change; existing builds are byte-identical.
New global field defaults to 'auto'; values 'cpu' | 'npu' force a specific ORT execution provider. PANDA_NPU env var will override at runtime (wired in a follow-up commit).
Resolves config string + PANDA_NPU env override into a concrete EP name. Pure function — no ORT integration yet. Unit-tested for the four relevant cases: feature-off auto, explicit cpu, unknown value, env override.
MiniLmEmbedder::new now builds [OpenVINO(NPU), CPU] when the openvino feature is on and the resolved EP is npu; falls back to [CPU] only on session-creation failure (unless PANDA_NPU_STRICT=1). CPU is always the final EP so ORT per-op fallthrough handles unsupported nodes.

Logs which EP actually loaded on every session creation.
Both the embedding daemon and the foreground panda process now call set_execution_provider after loading config, mirroring the existing set_model_name / set_ort_threads pattern.
Verifies that with --features openvino, the embedder produces the expected shape and L2-normalised vectors via OpenVINO NPU. Also verifies that hiding libopenvino_c.so triggers the CPU fallback path without surfacing the error.

Skipped silently unless OPENVINO_NPU_AVAILABLE=1, so CI machines without an NPU don't fail.
Without this, cargo build -p panda --features openvino fails because the panda binary crate has no such feature. Mirrors the panda-core/Cargo.toml addition from the same series.
Successor to the ORT-EP design after empirical verification showed ort 2.0.0-rc.12 cannot engage OpenVINO EP on Meteor Lake. Restores the archived ov_embed.rs as a near-verbatim port on top of Approach A scaffolding.
Approach A's closure-based EP-list with CPU fallback retry was scaffolding for the ort OpenVINO EP path, which empirical verification showed cannot engage on Meteor Lake with ort 2.0.0-rc.12. The raw OpenVINO bypass (next commits) makes the closure dead code. Reverting to upstream's exact builder makes that obvious and shrinks the diff.

MiniLmEmbedder is now strictly the CPU fallback path.
Replaces the broken ort/openvino EP wiring with the openvino + openvino-sys git deps from intel/openvino-rs (rev e25f1f848edc, runtime-linking). Same rev that worked on this hardware before the upstream merge — proven path.

No code consumes the new deps yet; that's added in the next commits.
Near-verbatim port from archive/pre-upstream-merge. Two surgical edits: 1) model_onnx_info replaced with a 2-arm model_seq_len matching upstream's slim model_registry (full table follows in a separate restore-models commit); 2) find_fastembed_onnx removed because hf_hub-based summarizer::resolve_model_files supersedes it.

Public surface: try_new, embed, is_degraded, mark_degraded, model_seq_len, ov_lib_path. Module is gated on --features openvino.
Lazy and eager construction APIs for the raw OpenVINO embedder, mirroring the existing MODEL_CACHE pattern. ov_embedder_is_active is exposed for the smoke test so NPU engagement can be asserted (not just configured).

Not yet wired into embed_and_normalize — that's the next commit.
embed_and_normalize and embed_direct both try the OV embedder first when --features openvino is on AND current_ep() == 'npu'. Falls through to ORT-CPU on any failure (including the first call after mark_degraded).

Also moves the CPU 'embedder loaded' log into get_model's init closure so it only fires when ORT actually constructs a session — fixes Approach A's false-NPU log bug. current_ep is promoted from pub(crate) to pub for the daemon's eager-preload call (next commit).
When the binary is built with --features openvino and the resolver picks NPU, the daemon now triggers OvEmbedder::try_new during start (after set_*, before bind). Multi-second NPU compile happens once at daemon-start, not on the first client embed call.

Init failure caches None in OV_EMBEDDER and embeds transparently fall to CPU.
Five tests covering ov_lib_path resolution (env unset, env-file form, env-dir form), model_seq_len lookup, and the DEGRADED flag's idempotence. None touch openvino-sys or NPU.

ENV_LOCK Mutex serialises tests that mutate process env, so cargo's default parallel test runner doesn't race them.
Approach A\047s assertions (3 vectors, 384 dim, L2-normalised) all hold whether NPU or CPU ran the inference, so the test passed even when OpenVINO EP silently failed to engage. Now we additionally assert summarizer::ov_embedder_is_active() — only true when OvEmbedder::try_new succeeded. Catches false-NPU regressions.

Skipped silently unless OPENVINO_NPU_AVAILABLE=1, same as before.
Approach A\047s section described the ort/openvino EP path. The bypass uses openvino-rs directly with a different runtime story: runtime-linking (no link-time deps), library finder, ~/.cache/panda/openvino blob cache, async pool. Adds notes on PANDA_NPU_STRICT and PANDA_NPU_PRECISION envs.
…ell cross-pollution

Both tests touch the OV_EMBEDDER OnceCell. When run together in one process the fall-back test (which sets OPENVINO_LIB_PATH=/dev/null) caches None first; the actually-uses-NPU test then can\047t recover.

Each integration test file is its own crate, so moving the fall-back test into npu_fallback.rs gives them separate processes and isolates OnceCell state.
preload_ov_embedder now returns Err when OPENVINO_NPU_AVAILABLE-style init fails AND PANDA_NPU_STRICT=1 is set. Daemon exits 1 on that Err so silent CPU fallback can't mask a misconfigured NPU during diagnosis.
@Smiie-2 Smiie-2 requested a review from AssafWoo as a code owner May 2, 2026 16:37

@AssafWoo AssafWoo left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really glad to see this — the raw OpenVINO bypass is the right call given the ORT rc.12 issues you documented. The design doc is thorough, and the cfg(feature = "openvino") gating means CPU users see exactly zero change. Full test suite passes locally (474 tests, 0 failed).

A few things to look at before merging:

ov_embed.rs lines 107-112 — unsafe impl Send/Sync
The safety comment claims openvino-rs marks CompiledModel and InferRequest as Send/Sync, but since the library doesn't derive those traits the compiler isn't enforcing that claim. If a future openvino-rs update changes the internal threading guarantees this becomes silent UB with no compile error. Worth adding a comment that cites the specific openvino-rs commit that was verified, so anyone who bumps the rev knows what to re-check.

ov_embed.rs lines 37-44 — ARM lib discovery
OV_C_LIB_CANDIDATES only has x86_64 Linux paths. Auto-discovery silently fails on aarch64-linux-gnu, leaving OPENVINO_LIB_PATH as the only option. Adding /usr/lib/aarch64-linux-gnu/libopenvino_c.so would cover ARM servers with a standard distro install.

ov_embed.rs lines 201-211 — empty pool race
embed() drains the entire request pool with std::mem::take before spawning workers. If two callers hit embed() concurrently the second one sees an empty pool, returns Err("NPU request pool empty"), and triggers mark_degraded, which permanently disables NPU for the rest of the process. In this PR alone the daemon is still single-threaded so it won't happen there, but once thread-per-connection lands (the daemon threading PR) concurrent embed calls from the daemon become realistic. Worth thinking about a Mutex or channel-based pool that serializes callers rather than failing the second one.

config.rs line 113 — free-form string field
execution_provider is a String, so a typo like "nup" or "GPU" silently falls back to CPU with no error. An enum with #[serde(rename_all = "lowercase")] would catch bad config values at parse time instead of at runtime when the user wonders why NPU isn't activating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants