Archivum

Archivum (Latin for archive) is a personal document and reference management system for academic papers, books, reports, and technical documents. It is built around content-addressable document identity: files are tracked by hash, bibliographic entries are tracked by tag, and the relationship between them is stored explicitly.

The project is optimized for a local, production document library rather than a public multi-user service. Treat the configured libraries and document store as real data.

What It Does

Manages self-contained libraries under local app data.
Stores metadata in Feather files for fast Pandas access:
- ref.feather: bibliographic reference metadata.
- doc.feather: document file metadata, hashes, versions, and paths.
- ref-doc.feather: links between reference tags and document hashes.
Synchronizes reference metadata with BibTeX.
Organizes documents in a sharded content-addressable document store.
Provides a Click CLI, an Uber Shell interactive interface, and a Flask web UI.
Supports full-text search through ripgrep over extracted document text.
Provides network and semantic discovery over query or ripgrep-defined universes.

Requirements

Windows PowerShell 7 (pwsh) for commands in this README.
Python >=3.13.
uv for dependency and environment management.
ripgrep (rg) for full-text search.
Optional external tools for report rendering and document processing, depending on workflow: Pandoc, Quarto, Tectonic, PDF/DJVU tooling.

Install or refresh the Python environment:

uv sync --extra test

Check the installed CLI:

uv run archivum --help

Quick Start

Start the Uber shell with the default library open:

uv run archivum uber -a

The prompt shows the active library. Run library-aware commands there:

stats
status
library-config

Open a specific library when starting Uber:

uv run archivum uber -l uber-library

Launch the web interface:

uv run archivum serve -b

By default the server listens on http://127.0.0.1:9124. Use --port, --address, --debug, and --prod for alternate launch modes:

uv run archivum serve uber-library --port 9124 --address 127.0.0.1 -b
uv run archivum serve uber-library --prod --address 0.0.0.0

Core Workflows

Find And Read

Search references with the query engine:

q title ~ /risk measure/ top 20
query

Find by tag, title, or file hash:

tag Wang2024 -o
title "spectral risk"
hash 100f150a -o

Search extracted document text with ripgrep:

rg "spectral risk measure"
rg "capital allocation" -i

Check whether a local file already exists in the library:

find-doc "C:\path\to\paper.pdf"

Import Documents

Stage a folder of new documents. This hashes files, checks for duplicates, extracts metadata, and writes a review BibTeX file.

stage-docs "C:\S\PDFs\Batch6"

Import a reviewed BibTeX file. Without -x, this is a dry run.

import-bibtex "C:\S\PDFs\Batch6\bibtex-import.bib"
import-bibtex "C:\S\PDFs\Batch6\bibtex-import.bib" -x

Useful import options:

stage-docs -nf: skip duplicate checking during staging.
stage-docs -d: delete duplicate files from the source folder if found.
import-bibtex -p <dir>: point to the directory containing referenced docs.
import-bibtex -nt: skip automatic text extraction.
import-bibtex -v, -vv, -vvv: increase diagnostics.

Link existing library objects when needed:

link-doc 100f150a -x
link-tag-hash Wang2024 100f150a

Maintain A Library

Audit structure without changing data:

library-audit -v

Validate and optionally fix specific structural issues:

library-validate --task sharding
library-validate --task sharding -x
library-validate --task orphans -x
library-validate --task missing

Manage extracted text:

extract-text --help

Edit or delete references deliberately:

edit-tag Wang2024
delete-tag Wang2024

Web Interface

The web interface is launched by archivum serve and is built with Flask, Bootstrap 5, and HTMX-style incremental updates. It is intended as the primary day-to-day interface for searching, reading, ingesting, and exploring the library.

Current screens include:

Query: fast metadata search with recent/read/random shortcuts, list/table/ verbose modes, CSV export, report handoff, and live interpretation feedback for plain-text fuzzy searches versus explicit querexfuzz expressions.
Ripgrep: streaming full-text search over extracted document text, summary and detail views, context controls, caching, and CSV export.
Authors: author-centered bibliography browsing.
Edit: admin metadata editor for BibTeX-backed tag updates.
Ingest: admin workbench for staging a document, editing source BibTeX, previewing the real importer result, and committing to the library.
Reports: Report Studio for persistent .qmd research reports, Pandoc HTML, Quarto PDF generation, and cached artifacts.
Network: co-author social graphs and semantic discovery over selected query or ripgrep universes.
Status: library identity, database counts, file sync state, and watcher state.
Help: in-app usage notes for query syntax and screen behavior.

The app uses split-horizon access control. Local or trusted traffic can receive admin capabilities, while external access can be restricted to read-only mode.

Semantic discovery uses all-MiniLM-L6-v2, caches the transformer model under %LOCALAPPDATA%\archivum\models\sentence-transformers, and caches embeddings per library/source so repeated runs avoid unnecessary encoding.

Data Model

Archivum separates references, documents, and links.

A reference is bibliographic metadata keyed by a tag such as Author2024.
A document is a physical file identified by content hash and version.
A ref-doc row links a tag to a document hash/version.

The Library class in src/archivum/library.py is the central data access layer. It loads and saves the Feather files, resolves configured paths, synchronizes BibTeX, runs ripgrep over extracted text, and provides import and audit helpers.

Documents are stored in a sharded directory structure. Internal metadata uses relative paths where possible so libraries remain portable across machines and drive mappings.

Configuration

Global configuration lives under local app data:

%LOCALAPPDATA%\archivum\global-config.yaml

Library-specific configuration lives in each library directory:

%LOCALAPPDATA%\archivum\libraries\<library-name>\config.yaml

Important configured concepts:

default_library: library opened when no explicit name is supplied.
doc_store_lib: shared document store root.
bibtex_file: synchronized BibTeX path for a library.
query, enhancement, timezone, table, extractor, and tag-mapping defaults.

Use the CLI to inspect the active configuration:

library-config

Development And Tests

Use uv for dependency management:

uv lock
uv sync --extra test

pyproject.toml owns the package version. Bump project.version after every code change using semantic versioning; archivum.__version__ reads the installed package metadata at runtime.

Run focused web tests through the project runner:

.\scripts\Test-ArchivumWeb.ps1 -Mode Fast
.\scripts\Test-ArchivumWeb.ps1 -Mode Slow
.\scripts\Test-ArchivumWeb.ps1 -Mode All

The runner defaults to uv run --extra test pytest.

Direct pytest examples:

uv run --extra test pytest -q
uv run --extra test pytest -m "web and not slow and not browser"
uv run --extra test pytest -m "web and slow and not browser" --run-slow-web

Browser tests require Playwright browser support:

uv run --extra test python -m playwright install chromium
uv run --extra test pytest -m "web and browser" --run-browser --run-slow-web

Many web tests require a configured active Archivum library. Slow semantic tests can use model and embedding caches and may take longer on a cold machine.

Safety Notes

Do not delete source documents unless the exact operation is intentional.
Be careful around hardlinks, sharded document storage, Feather files, and BibTeX synchronization.
Prefer dry runs first. Many commands require -x or --execute before they write changes.
Multiple libraries may share the same physical document store.
Keep web help in src/archivum/web/templates/help.html aligned with user-facing web behavior.

Related References

Query engine: querexfuzz
Full-text engine: ripgrep
Web app entry point: src/archivum/web/app.py
CLI entry point: src/archivum/cli.py
Core library model: src/archivum/library.py

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
docs		docs
scripts		scripts
src/archivum		src/archivum
temp		temp
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
codex.md		codex.md
doc-test.bat		doc-test.bat
human-notes.md		human-notes.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Archivum

What It Does

Requirements

Quick Start

Core Workflows

Find And Read

Import Documents

Maintain A Library

Web Interface

Data Model

Configuration

Development And Tests

Safety Notes

Related References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Archivum

What It Does

Requirements

Quick Start

Core Workflows

Find And Read

Import Documents

Maintain A Library

Web Interface

Data Model

Configuration

Development And Tests

Safety Notes

Related References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages