Engram
A KG-augmented RAG library with iterative retrieve-and-reason for multi-hop questions.
Status: pre-alpha. Not on PyPI yet — install from source (Install). API will change before v0.1.0. Full evolution in CHANGELOG.md.
Engram is a Python library for production RAG with three opt-in capabilities most other libraries don't ship:
- IRCoT (iterative retrieve-and-reason). Round 1 retrieves; reader emits a CoT thought; round 2 retrieves with the thought as augmented query. +0.09 F1 over single-pass on MuSiQue at gpt-4o-mini reader.
- A knowledge-graph layer. Entity + fact extraction at ingest; two-stage Personalized PageRank + multi-hop beam search + triple-vector ANN match + RRF fusion at query time. Opt-in via
--kg-retrieval. Adds graph capabilities; F1 lift over baseline+IRCoT is neutral on MuSiQue (use it for the capabilities, not for the benchmark). - A strategic router (adaptive per-query orchestration). One token-minimal LLM call per query decides which capabilities to enable — IRCoT, KG traversal, decomposition, MQE, retrieval planner — based on the question's structure. Statistical parity with the benchmark-tuned static config at ~40% lower median latency. Opt-in via
--adaptive. See Adaptive strategic router.
Production-quality components — BM25 + dense + RRF, Cohere Rerank 3.5, Jaccard dedup — are wired into the default pipeline.
Reference on MuSiQue dev (benchmarks/fixtures/musique_n200_seed1_ids.json, gpt-4o-mini reader, text-embedding-3-small embedder, Cohere Rerank 3.5 via AWS Bedrock):
| Config | F1 | EM | Notes |
|---|---|---|---|
| Plain hybrid (no rerank) | ~0.40 | — | Field reference floor |
| + Cohere Rerank ("fast mode") | 0.46 | 0.32 | Engram's no-IRCoT default |
| + IRCoT (production v1) | 0.54 | 0.40 | Headline number |
| + KG-hybrid retrieval | 0.51-0.53 | 0.36-0.39 | Adds graph capabilities; no F1 lift over IRCoT |
| + Graph-aware retrieval planner (opt-in) | within noise | within noise | One LLM call up-front emits a typed plan that biases beam search + rerank — see below |
Adaptive strategic router (--adaptive) |
0.52 | 0.38 | Within run variance of production v1 at ~40% lower median latency (p50 8.9s vs 14.5s); skips KG traversal on 48% of queries and the second reader round on 24% |
| Field SOTA at this reader (G-reasoner) | 0.525 | 0.385 | Trained 8M-param GNN |
Engram baseline + IRCoT lands at field SOTA for gpt-4o-mini on n=200. Sample variance is ±0.02-0.03 F1 across reruns. Reproduce with benchmarks/musique.py; methodology in docs/benchmarks.md.
Document → Chunker → [optional: Engram.aenrich] → Embedder → Vector DB
↑
builds KG (entity graph, bi-temporal supersession)
via cold-path LLM extraction
│
▼
Retriever → [Engram retrieval modes] → Reader
↑
hybrid + Cohere rerank
+ IRCoT + optional KG fusion
Engram is two things at once:
- An ingestion enricher (
Engram.aenrich, opt-in viabuild_graph=True) that builds a knowledge graph alongside your chunks. - A retrieval lift layer (hybrid + Cohere rerank + IRCoT + optional KG fusion at query time). Today this lives in
benchmarks/; promoted to a stable library API in v0.1.0.
Not on PyPI yet. Package name reserved for v0.1.0.
git clone https://github.com/Vrin-cloud/engram
cd engram
uv venv && uv pip install -e ".[memory,llm,benchmarks,observability]"| Extra | Brings in |
|---|---|
memory |
lmdb, hnswlib, numpy, networkx, scipy — the default MemoryBackend |
llm |
litellm, instructor, tenacity — the LLMProvider stack |
benchmarks |
datasets, rank-bm25 — to run the MuSiQue benchmark |
observability |
opentelemetry — tracing spans for ingest + query |
all |
every extra |
| Var | Used by |
|---|---|
OPENAI_API_KEY |
text-embedding-3-small (embedder), gpt-4o-mini (reader) defaults |
| AWS credentials (default chain) | Cohere Rerank 3.5 on Bedrock — region defaults to us-east-1 |
ANTHROPIC_API_KEY |
only if you pass --reader-model anthropic/claude-haiku-4-5 |
python -m benchmarks.musique \
--question-ids-file benchmarks/fixtures/musique_n200_seed1_ids.json \
--mode baseline \
--rerank \
--ircot \
--output predictions.jsonlExpected: F1 ~0.54, EM ~0.40, ~5 min wall time, ~$1.50 in API costs. This is the production v1 default (hybrid + Cohere rerank + Jaccard dedup + IRCoT) on the canonical 200-question MuSiQue fixture.
python -m benchmarks.musique \
--question-ids-file benchmarks/fixtures/musique_n200_seed1_ids.json \
--mode enriched \
--build-graph --kg-retrieval --rerank --ircot \
--disable-derivation --disable-bridging--build-graph runs Engram's cold-path ingest (entity extraction → canonical resolution with alias persistence → fact extraction with literal-value sink → fact-triple embedding into a second hnswlib → bi-temporal supersession). --kg-retrieval fuses hybrid retrieval with triple-vector ANN + hub-weighted two-stage PPR + multi-hop confidence-decayed beam search via RRF.
Add --retrieval-planner to layer the graph-aware planner on top: one extra LLM call per query produces a structured RetrievalPlan (expected answer type, priority predicates, optional hop sequence) that biases beam search, post-fusion fact filtering, and the Cohere Rerank query. Default off; opt in when you want explainable retrieval traces. --trace-retrieval-plan PATH dumps every plan to JSONL for inspection.
Costs ~$0.40 per 1K chunks of cold-path + ~12 min/1K of ingest latency. Use it when you need graph queries, contradiction surfacing, or bi-temporal awareness — not for F1 lift on standard QA.
python -m benchmarks.musique \
--question-ids-file benchmarks/fixtures/musique_n200_seed1_ids.json \
--mode enriched \
--build-graph --rerank --adaptive--adaptive puts capability selection under a per-query strategic router: one token-minimal LLM call (output is a bare list of capability tags, single-digit tokens) that reads the question's structure and decides what this specific query needs. The router picks from {ircot, kg_retrieval, decomposition, mqe, retrieval_planner}; hybrid retrieval, Cohere Rerank, and entity extraction are always on underneath.
The routing rests on one core distinction the prompt teaches explicitly:
- Depth (sequential chains — "the X of the Y", bridge entities that must be discovered before the next hop) →
ircot. Decomposition measurably hurts depth chains (parallel sub-questions can't phrase later hops before earlier answers exist). - Breadth (open-ended, multi-aspect — evaluations, memos, bear/bull cases) →
decomposition, usually withkg_retrieval+ircot.
One hard safety rule: questions that don't explicitly name their subject ("this round", "the company") must enable kg_retrieval so the graph anchors the query on the corpus's central entities instead of guessing — this eliminates a measured subject-hallucination failure mode.
Measured on the n=200 MuSiQue fixture: EM 0.38 / F1 0.52 (within run variance of the static production v1 config) at p50 latency 8.9s vs 14.5s — the router skips graph traversal on 48% of queries and the second reader round on 24%. On an open-ended evaluation fixture (benchmarks/fixtures/pebble_corpus.jsonl) the same router with no per-domain tuning shifts its plan profile to kg 100% / decomposition 50% and outscores both the static config and plain hybrid. Routing triggers must reference observable question structure — an earlier prompt that asked the router to predict retrieval insufficiency (unobservable from question text) fired IRCoT on only 21% of multi-hop questions and cost 5 EM points.
Compare configurations on your own corpus with benchmarks/custom_kb.py: it runs standard (hybrid + rerank), vrin (static full stack), and vrin_adaptive (router) side by side over a JSONL corpus + question set and reports per-question deltas plus each routing decision.
Today's stable surface is the ingest API:
from engram import Engram
from engram.backends.memory import MemoryBackend
from engram.llm.embedders import LiteLLMEmbedder
from engram.llm.litellm_provider import LiteLLMProvider
from engram.llm.rate_control import AdaptiveConcurrency, TokenBucket
embedder = LiteLLMEmbedder(model="openai/text-embedding-3-small")
backend = MemoryBackend(embedder=embedder, path="./engram-data")
llm = LiteLLMProvider(
bucket=TokenBucket(rate=20.0, burst=25),
adaptive=AdaptiveConcurrency(initial_limit=4, max_limit=12),
default_model="openai/gpt-4o-mini",
)
engram = Engram(
corpus_backend=backend,
llm=llm,
build_graph=True,
enable_synthesis=False,
enable_derivation=False,
enable_bridging=False,
model="openai/gpt-4o-mini",
)
enriched = await engram.aenrich(your_chunks)
# Each EnrichedChunk has: id, text, source_id, enrichment_summary, metadata
# The backend's fact_graph (networkx.MultiDiGraph) holds entity ↔ fact structure
# Stored facts queryable via backend.find_facts(...) and backend.neighbors_facts(...)The query-time lift layer (hybrid_neighbors, kg_hybrid_neighbors, IRCoT 2-round flow) currently lives in benchmarks/runner.py; promoted to engram.retrieve / engram.iterative_query in v0.1.0.
| Doc | What's in it |
|---|---|
| docs/architecture.md | High-level architecture, data model, storage layout, ingest + query lifecycles |
| docs/benchmarks.md | MuSiQue methodology, fixtures, replay against existing indices, cost / latency profiles |
| docs/configuration.md | Every CLI flag and constructor parameter, with defaults and effects |
| docs/kg-internals.md | LMDB sub-db layout, fact graph schema, PPR / beam search internals |
| docs/llm-provider.md | LiteLLM routing, Instructor strict-mode, prompt caching, rate control |
| docs/concepts/ircot.md | IRCoT pattern explainer + why it works at our scale |
| docs/concepts/kg-retrieval.md | Triple match, two-stage PPR (PropRAG), beam search, RRF fusion |
| docs/concepts/synthesis-and-extraction.md | Hot path vs cold path, entity / fact extraction, pronoun resolution |
| docs/concepts/bi-temporal.md | Fact supersession, valid_from / valid_to / recorded_at semantics, Noisy-OR fusion |
python -m benchmarks.musique --help is the source of truth. Common flags:
| Flag | Default | Effect |
|---|---|---|
--mode {baseline,enriched,both} |
both |
baseline skips cold path; enriched runs Engram.aenrich first |
--rerank / --no-rerank |
on | Cohere Rerank 3.5 via Bedrock |
--ircot |
off | 2-round retrieve-then-reason. +0.09 F1. |
--build-graph |
off | Cold-path KG ingest |
--kg-retrieval |
off | Triple match + 2-stage PPR + beam at query time |
--disable-synthesis |
off | Skip per-chunk synthesis (saves ~$0.30/1K + ~4 min/1K) |
--disable-derivation |
off | Skip cold-path derivation pass |
--disable-bridging |
off | Skip cold-path bridging pass |
--adaptive |
off | Strategic router decides capabilities per query. Requires --build-graph |
--question-ids-file PATH |
— | Pin to a question ID set for reproducibility |
--data-dir PATH |
~/.engram-bench/... |
Where LMDB indexes live |
Full reference: docs/configuration.md.
These reflect measurements, not theory. Full ablation log in docs/benchmarks.md.
- Jaccard dedup post-rerank is always on. Cheap quality win.
- IRCoT is opt-in but recommended as the default lift layer (+0.09 F1).
- Synthesis contributes ~+0.04 F1 only when KG retrieval is also on. Otherwise it's pure latency cost.
- Derivation and bridging were deferred in the v0 KG-hybrid plan; benchmark defaults are
--disable-derivation --disable-bridging. - MQE, decomposition, sufficiency-judge, CRAG-style filter all regressed when stacked on top of baseline + IRCoT. Not in the production config. The adaptive router resolves the decomposition finding: it's a depth-vs-breadth mismatch — decomposition hurts sequential factoid chains (where these ablations ran) but helps open-ended multi-aspect questions; the router applies it only to the latter.
- Graph-aware retrieval planner is opt-in via
--retrieval-planner(default OFF). One LLM call per query reads a compressed view of the relevant fact-graph slice and emits a typedRetrievalPlan(expected answer type, priority predicates, optional hop sequence). The plan biases — does not replace — beam search edge weighting, post-fusion fact filtering, and the Cohere Rerank query. Metric impact is within run-to-run variance at n=100; ship it for the capability (explainable retrieval, structured plan traces) rather than the EM/F1 number. Plumbing insrc/engram/core/graph_view.py,src/engram/dialogue/retrieval_planner.py, prompt insrc/engram/dialogue/prompts/retrieval_plan.py.
Per 1K chunks at ~400 tokens each, gpt-4o-mini + text-embedding-3-small + Cohere Rerank 3.5:
| Stage | Default mode (IRCoT, no KG) | KG mode (--build-graph --kg-retrieval) |
|---|---|---|
| Ingest cost | ~$0.008 | ~$0.46 (with synthesis) / ~$0.38 (without) |
| Ingest latency | ~15 sec | ~12 min (with synthesis) / ~8.5 min (without) |
| Query cost | ~$0.003 | ~$0.004 |
| Query latency | ~3-5 sec | ~4-6 sec |
| System | Mechanism | Where they fit |
|---|---|---|
| Cohere Rerank, Voyage Rerank | Cross-encoder rerank over the retriever's top-K | Engram uses Cohere Rerank — a building block, not a competitor |
GBrain (github.com/garrytan/gbrain) |
Production hybrid retrieval: BM25 + dense + RRF + cross-encoder rerank, intent classification, mode bundles, structural code-edge walk | The production-grade hybrid retrieval reference. Engram extends it with iterative retrieve-then-reason (IRCoT) and an entity/fact KG |
| HippoRAG 2 | OpenIE triples + synonym edges + PPR + recognition gate | Closest conceptually to Engram's KG mode |
| PropRAG | n-ary propositions + two-stage PPR + beam | Two-stage PPR is ported (engram.core.kg_retrieval.two_stage_ppr_facts). Propositions are deferred |
| LangChain, LlamaIndex | RAG framework + integrations | Engram is a focused library; could be wrapped by either |
| Path | What's there |
|---|---|
src/engram/core/models.py |
Pydantic v2: Chunk, EnrichedChunk, Fact, EntityRecord, Contradiction, CrossReference |
src/engram/core/protocol.py |
CorpusBackend, LLMProvider, Embedder — async-first protocols |
src/engram/core/scoring.py |
deduplicate_chunks (Jaccard 0.70) |
src/engram/core/entities.py |
normalize_entity_name, entities_match_fuzzy, case_variants, is_literal_value |
src/engram/core/retrieval.py |
TraversalConfig, merge_fact_strategies (RRF), dynamic_chunk_cutoff |
src/engram/core/kg_retrieval.py |
triple_match, ppr_facts, two_stage_ppr_facts, beam_search_facts, facts_to_chunk_ids |
src/engram/backends/memory.py |
MemoryBackend — LMDB + hnswlib + in-memory networkx graph |
src/engram/dialogue/strategic_router.py |
Adaptive router — decide_strategy + dependency resolution |
src/engram/dialogue/prompts/strategic_plan.py |
StrategicPlan schema + the depth/breadth routing prompt |
src/engram/dialogue/orchestrator.py |
Engram orchestrator — hot path + cold path |
src/engram/dialogue/extraction.py |
Batched entity + fact extraction (Instructor strict-mode) |
src/engram/dialogue/prompts/extraction.py |
Prompts with explicit pronoun/coref resolution |
src/engram/dialogue/temporal.py |
Bi-temporal conflict detection + supersession |
src/engram/dialogue/contradiction.py |
Noisy-OR confidence fusion |
src/engram/llm/litellm_provider.py |
LLMProvider over LiteLLM + Instructor |
src/engram/llm/embedders.py |
LiteLLMEmbedder, OllamaEmbedder |
src/engram/llm/rate_control.py |
TokenBucket + AdaptiveConcurrency |
benchmarks/musique.py |
CLI for the MuSiQue ablation benchmark (--adaptive for the router) |
benchmarks/custom_kb.py |
Bring-your-own-corpus comparison: standard vs static vs adaptive |
benchmarks/runner.py |
Query pipeline incl. IRCoT 2-round + kg_hybrid_neighbors |
benchmarks/retrieval.py |
hybrid_neighbors, kg_hybrid_neighbors, RRF, BM25, dedup wiring |
benchmarks/fixtures/ |
Pinned MuSiQue question-id JSON files (n=100, n=200) |
| Surface | Module | Status |
|---|---|---|
Engram class (.enrich, .aenrich) |
engram |
Stable |
MemoryBackend |
engram.backends.memory |
Stable; LMDB schema versions tracked in CHANGELOG.md |
CorpusBackend, LLMProvider, Embedder protocols |
engram.core.protocol |
Stable contracts |
Chunk, EnrichedChunk, Fact, EntityRecord |
engram.core.models |
Stable Pydantic v2 |
LiteLLMProvider, LiteLLMEmbedder |
engram.llm.* |
Stable |
kg_hybrid_neighbors, IRCoT loop |
benchmarks.runner |
Promoted to engram.retrieve in v0.1.0 — currently in benchmarks namespace |
| Track | State |
|---|---|
| KG-hybrid foundation (Phases 1-6) | shipped to main |
| IRCoT 2-round retrieval | shipped (opt-in via --ircot) |
| Two-stage PPR (PropRAG) + multi-hop beam search | shipped |
| Entity resolution + alias persistence + pronoun-aware extraction | shipped |
| Adaptive strategic router | shipped on strategic-router (opt-in via --adaptive) |
| Reactive escalation (round-2 from round-1 evidence signals, not upfront prediction) | next |
| Promote query-side API to library namespace | scheduled for v0.1.0 |
| Public PyPI release | scheduled for v0.1.0 |
| HopRAG pseudo-question edges, PropRAG propositions, doc-level preamble | deferred |
| Hosted Vrin API + SDK + MCP server | planning phase |
See CONTRIBUTING.md. All commits must carry a Signed-off-by trailer per the Developer Certificate of Origin.
Report vulnerabilities to vedant@vrin.cloud. See SECURITY.md.