Skip to content

pymupdf/langchain-pymupdf4llm

Repository files navigation

langchain-pymupdf4llm

An independent LangChain integration package connecting PyMuPDF4LLM to LangChain as a document loader.

LangChain v1.0+ PyMuPDF4LLM Python 3.10+ License: AGPL-3.0-only PyPI Downloads Discord Forum Twitter Hugging Face Demo

Introduction

langchain-pymupdf4llm integrates PyMuPDF4LLM with LangChain as a document loader. It extracts PDF content into Markdown for LLM and retrieval-augmented generation workflows.

Features

PyMuPDF4LLM provides Markdown extraction for standard text, tables, headers, lists, code blocks, multi-column pages, images, and vector graphics.

This integration adds LangChain loader and parser APIs, including optional image description replacement when an image parser is provided.

Requirements

  • Python 3.10 or higher
  • LangChain Core v1.0.0 or higher
  • PyMuPDF4LLM v1.27.2.1 or higher

Installation

Install the package using pip:

pip install -U langchain-pymupdf4llm

Before installing, make sure the AGPL/commercial licensing model of the PyMuPDF stack works for your use case.

For optional image parsing capabilities, you may also want to install:

pip install langchain-community

Usage

from langchain_pymupdf4llm import PyMuPDF4LLMLoader

loader = PyMuPDF4LLMLoader(
    file_path="/path/to/input.pdf",
    mode="single",
    pages_delimiter="\n\f"
)

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Use lazy_load() to stream documents:

for doc in loader.lazy_load():
    print(doc.metadata)

Use the parser with LangChain blob loaders:

from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_pymupdf4llm import PyMuPDF4LLMParser

loader = GenericLoader(
    blob_loader=FileSystemBlobLoader(path="path/to/docs/", glob="*.pdf"),
    blob_parser=PyMuPDF4LLMParser(),
)

Image options

You can utilize the Langchain Community LLMImageBlobParser along with a model to describe sourced images instead of reference them by filename.

For example:

from langchain_pymupdf4llm import PyMuPDF4LLMLoader
from langchain_community.document_loaders.parsers import LLMImageBlobParser
from langchain_openai import ChatOpenAI

loader = PyMuPDF4LLMLoader(
    "test.pdf",
    mode="page",
    extract_images=True,
    images_parser=LLMImageBlobParser(
        model=ChatOpenAI(model="gpt-5.5", max_tokens=1024),
        prompt="Describe the content of each image in a few sentences."
    ),
)
docs = loader.load()

print(docs[0].page_content[0:])

Development

Open the workspace in the devcontainer, then install dependencies manually:

uv sync --group dev --group test --group lint --group typing

Install lightweight pre-commit hooks for formatting and hygiene checks:

uv run pre-commit install

Common commands are available as Cursor/VS Code tasks:

  • uv sync
  • test
  • coverage
  • lint
  • format
  • typecheck
  • jupyter

JupyterLab is configured as a foreground task on port 8888. It does not start automatically when the container starts.

Run checks locally:

uv run --group test python -m pytest
uv run pytest --cov=src/langchain_pymupdf4llm --cov-report=term-missing --cov-fail-under=90
uv run black --check .
uv run ruff check .
uv run mypy .
uv run pre-commit run --all-files

The default pytest run disables sockets and skips tests marked network. To run network tests explicitly:

uv run --group test python -m pytest --force-enable-socket -m network

Creating Test Documents

To recreate the example PDF documents from LaTeX with deterministic PDF metadata:

cd ./tests/examples
SOURCE_DATE_EPOCH=1704067200 FORCE_SOURCE_DATE=1 pdflatex -interaction=nonstopmode sample_1.tex

Jupyter Notebooks

Start JupyterLab from the devcontainer:

uv run jupyter lab --ip 0.0.0.0 --port 8888 --no-browser

Licensing

This package depends directly on pymupdf4llm / pymupdf, which are published by Artifex under AGPL/commercial terms. Because this integration wraps that stack directly, this repository is distributed under AGPL-3.0-only.

PyMuPDF4LLM and PyMuPDF are maintained by Artifex Software, Inc.

  • Open sourceGNU AGPL v3. Free for open-source projects.
  • Commercial — separate commercial licences available from Artifex for proprietary applications.

Contributing

Contributions are welcome. Please open an issue before submitting large pull requests.

⭐ Support this project

If you find this useful, please consider giving it a star — it helps others discover it!

Star on GitHub

About

An integration package connecting PyMuPDF4LLM to LangChain

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors