Local LLM Guide

Use Engram's local LLM path when you want extraction, reranking, and selected helper flows to stay on an OpenAI-compatible endpoint you control.

This guide applies when modelSource is plugin (the default). If you switch to modelSource: "gateway", Engram sends extraction/consolidation/rerank calls to the configured gateway agent chain instead, and localLlm* settings no longer control the primary extraction path.

Fast Start

If you want the preset first:

{
  "memoryOsPreset": "local-llm-heavy",
  "localLlmEnabled": true,
  "localLlmUrl": "http://localhost:1234/v1",
  "localLlmModel": "qwen2.5-32b-instruct",
  "localLlmFastEnabled": true,
  "localLlmFastModel": "qwen2.5-7b-instruct"
}

The preset seeds the broader advanced surface, but your explicit localLlm* values still win.

Recommended Split

Use the primary local model for slower tasks:

extraction
consolidation
profile / identity consolidation

Use the fast local tier for short-turn helpers:

rerank
entity summary
temporal-memory summaries
compression-guideline refinement

Key Settings

Setting	Why it matters
`localLlmEnabled`	Master switch for Engram's local inference path while `modelSource=plugin`
`localLlmUrl`	Base URL for the OpenAI-compatible endpoint
`localLlmModel`	Main local model ID
`localLlmFastEnabled`	Enables the smaller/faster local tier
`localLlmFastModel`	Model ID for the fast tier
`localLlmFallback`	If `true`, fail open to the gateway/cloud chain when the local endpoint is unavailable
`localLlmTimeoutMs`	Upper bound for heavier local requests
`localLlmFastTimeoutMs`	Lower bound for fast helper requests
`embeddingFallbackProvider`	Use `local` if you also want embedding fallback to stay local

Operational Notes

Keep localLlmFallback=true during bring-up unless you explicitly want hard failures.
If the local server reports a smaller context window than the model supports, set localLlmMaxContext.
If reranking feels slow, lower rerankTimeoutMs before changing the main extraction timeout.
The local path is fail-open by default. If the endpoint disappears, Engram should degrade rather than block the gateway.

Custom OpenAI-Compatible Endpoint (Self-Hosted)

If you run a single OpenAI-compatible server (vLLM, Ollama, LM Studio, etc.) that serves both chat completions and embeddings, you can point everything at it with just openaiBaseUrl:

{
  "openaiBaseUrl": "http://localhost:8005/v1",
  "openaiApiKey": "dummy",
  "embeddingFallbackEnabled": true,
  "embeddingFallbackProvider": "openai"
}

Engram routes extraction, consolidation, and embedding requests to your endpoint. The embedding path appends /embeddings to openaiBaseUrl automatically.

Separate Chat and Embedding Models

If your chat model and embedding model live on different servers, use openaiBaseUrl for chat and the localLlm* settings for embeddings:

{
  "openaiBaseUrl": "http://localhost:8005/v1",
  "openaiApiKey": "dummy",
  "localLlmEnabled": true,
  "localLlmUrl": "http://localhost:8006/v1",
  "localLlmModel": "bge-m3",
  "localLlmApiKey": "dummy",
  "embeddingFallbackEnabled": true,
  "embeddingFallbackProvider": "local"
}

Docker Networking

When running OpenClaw in Docker and the LLM server on the host, use host.docker.internal instead of localhost:

{
  "openaiBaseUrl": "http://host.docker.internal:8005/v1",
  "openaiApiKey": "dummy"
}

When Not To Use It

If you do not have a stable local endpoint yet, start with memoryOsPreset: "balanced".
If you want the lowest moving-part count, start with conservative and leave local inference disabled until the rest of the install is stable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Local LLM Guide

Fast Start

Recommended Split

Key Settings

Operational Notes

Custom OpenAI-Compatible Endpoint (Self-Hosted)

Separate Chat and Embedding Models

Docker Networking

When Not To Use It

Uh oh!

FilesExpand file tree

local-llm.md

Latest commit

History

local-llm.md

File metadata and controls

Local LLM Guide

Fast Start

Recommended Split

Key Settings

Operational Notes

Custom OpenAI-Compatible Endpoint (Self-Hosted)

Separate Chat and Embedding Models

Docker Networking

When Not To Use It