An independent LangChain integration package connecting PyMuPDF4LLM to LangChain as a document loader.
langchain-pymupdf4llm integrates PyMuPDF4LLM with LangChain as a document
loader. It extracts PDF content into Markdown for LLM and retrieval-augmented
generation workflows.
PyMuPDF4LLM provides Markdown extraction for standard text, tables, headers, lists, code blocks, multi-column pages, images, and vector graphics.
This integration adds LangChain loader and parser APIs, including optional image description replacement when an image parser is provided.
- Python 3.10 or higher
- LangChain Core v1.0.0 or higher
- PyMuPDF4LLM v1.27.2.1 or higher
Install the package using pip:
pip install -U langchain-pymupdf4llmBefore installing, make sure the AGPL/commercial licensing model of the PyMuPDF stack works for your use case.
For optional image parsing capabilities, you may also want to install:
pip install langchain-communityfrom langchain_pymupdf4llm import PyMuPDF4LLMLoader
loader = PyMuPDF4LLMLoader(
file_path="/path/to/input.pdf",
mode="single",
pages_delimiter="\n\f"
)
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)Use lazy_load() to stream documents:
for doc in loader.lazy_load():
print(doc.metadata)Use the parser with LangChain blob loaders:
from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_pymupdf4llm import PyMuPDF4LLMParser
loader = GenericLoader(
blob_loader=FileSystemBlobLoader(path="path/to/docs/", glob="*.pdf"),
blob_parser=PyMuPDF4LLMParser(),
)You can utilize the Langchain Community LLMImageBlobParser along with a model to describe sourced images instead of reference them by filename.
For example:
from langchain_pymupdf4llm import PyMuPDF4LLMLoader
from langchain_community.document_loaders.parsers import LLMImageBlobParser
from langchain_openai import ChatOpenAI
loader = PyMuPDF4LLMLoader(
"test.pdf",
mode="page",
extract_images=True,
images_parser=LLMImageBlobParser(
model=ChatOpenAI(model="gpt-5.5", max_tokens=1024),
prompt="Describe the content of each image in a few sentences."
),
)
docs = loader.load()
print(docs[0].page_content[0:])Open the workspace in the devcontainer, then install dependencies manually:
uv sync --group dev --group test --group lint --group typingInstall lightweight pre-commit hooks for formatting and hygiene checks:
uv run pre-commit installCommon commands are available as Cursor/VS Code tasks:
uv synctestcoveragelintformattypecheckjupyter
JupyterLab is configured as a foreground task on port 8888. It does not start automatically when the container starts.
Run checks locally:
uv run --group test python -m pytest
uv run pytest --cov=src/langchain_pymupdf4llm --cov-report=term-missing --cov-fail-under=90
uv run black --check .
uv run ruff check .
uv run mypy .
uv run pre-commit run --all-filesThe default pytest run disables sockets and skips tests marked network. To run
network tests explicitly:
uv run --group test python -m pytest --force-enable-socket -m networkTo recreate the example PDF documents from LaTeX with deterministic PDF metadata:
cd ./tests/examples
SOURCE_DATE_EPOCH=1704067200 FORCE_SOURCE_DATE=1 pdflatex -interaction=nonstopmode sample_1.texStart JupyterLab from the devcontainer:
uv run jupyter lab --ip 0.0.0.0 --port 8888 --no-browserThis package depends directly on pymupdf4llm / pymupdf, which are published
by Artifex under AGPL/commercial terms. Because this integration wraps that
stack directly, this repository is distributed under AGPL-3.0-only.
PyMuPDF4LLM and PyMuPDF are maintained by Artifex Software, Inc.
- Open source — GNU AGPL v3. Free for open-source projects.
- Commercial — separate commercial licences available from Artifex for proprietary applications.
Contributions are welcome. Please open an issue before submitting large pull requests.
If you find this useful, please consider giving it a star — it helps others discover it!