Skip to content

Latest commit

 

History

History
377 lines (277 loc) · 13.1 KB

File metadata and controls

377 lines (277 loc) · 13.1 KB

doc-cleaner

GitHub release License: MIT Python 3.9+ Supported Formats Last Commit

Everyday document-to-Markdown conversion — 16 formats spanning PDF, Office, Apple Keynote/Numbers, and EPUB e-books. CJK-friendly, table preservation, privacy-first, fully local.

Part of the notoriouslab open-source toolkit · Requires Python 3.9+

中文 README

Download Desktop App (no Python required)

Platform Download Architecture
macOS Doc Cleaner-1.6.0.dmg Universal (Intel + Apple Silicon)
Windows Doc Cleaner-1.6.0.msi x86_64 (runs on ARM Windows too)

First launch on macOS: Right-click → Open (Ventura and earlier) or System Settings → Privacy & Security → Open Anyway (Sonoma/Sequoia)

Or run once in Terminal: xattr -cr /Applications/Doc\ Cleaner.app


Core Positioning

doc-cleaner focuses on everyday document extraction — turning the files you meet daily into clean, readable Markdown: PDF, Word, Excel, PowerPoint, Apple Keynote/Numbers, and EPUB e-books, with CJK text intact, tables preserved, and everything processed locally.

Few free tools combine CJK-friendly + table preservation with coverage of Apple Keynote/Numbers and EPUB — that combination is doc-cleaner's strength.

Typical use cases:

  • 🖥️ Desktop App — Drag-and-drop files or whole folders, pick any output location, preview the result in-app; zero config, double-click on macOS/Windows (non-technical users)
  • 📊 Financial statements — Big5/CP950 auto-detected, transactions and numbers extracted intact
  • 🎬 Slides / e-books — Keynote slides and EPUB chapters extracted to Markdown in reading order
  • 📄 Batch multi-format — Mix 16 formats as input, unified Markdown output (CLI)
  • 🔒 Privacy-first--ai none plain text or local Ollama; documents never leave your machine
  • 🤖 AI agent integration — OpenClaw and similar frameworks can shell-call it with SKILL.md support

Three Core Strengths

Feature Implementation
Most complete everyday formats PDF, Office, Apple Keynote/Numbers, EPUB, DXF… 16 formats, one tool
CJK + tables intact Big5/CP950/UTF-16 auto-detect; DOCX/XLSX/PDF tables → Markdown pipe tables, numbers preserved
Privacy & no-AI mode --ai none for zero API keys and zero cloud; or local Ollama inference

Quick Start

Option A: Desktop App (non-technical users)

Download the DMG (macOS) or MSI (Windows) from the table above — no Python required.

Option B: CLI (3 steps)

# 1. Clone
git clone https://github.com/notoriouslab/doc-cleaner.git && cd doc-cleaner

# 2. Install
pip install -r requirements.txt

# 3. Run
python cleaner.py --input ./documents/ --ai none

Output: .md files in ./output/ for each input file

Common Usage Paths

Path 1: Plain text extraction (no API key required)

# Simplest option, zero cost
python cleaner.py --input statement.pdf --ai none

Path 2: Gemini for higher quality (cloud recommended)

cp .env.example .env
# Edit .env, add your GEMINI_API_KEY
python cleaner.py --input statement.pdf --ai gemini

Path 3: Local Ollama (privacy-first)

# Requires Ollama installed and running (see Ollama recommendations below)
python cleaner.py --input statement.pdf --ai ollama

Path 4: Preview before processing

python cleaner.py --input ./documents/ --dry-run --verbose

Optional Advanced Install

High-quality PDF table extraction, decryption, PPTX/DXF support, etc.:

# High-quality PDF extraction (recommended)
pip install opendataloader-pdf            # Requires Java 11+

# PDF vision mode (scanned PDFs)
pip install pdf2image                     # Also requires: brew install poppler

# PDF decryption
pip install pikepdf

# Extra formats (PPTX / DXF)
pip install python-pptx ezdxf

Set API keys (if using cloud backend):

cp config.example.json config.json
cp .env.example .env
# Edit .env with GEMINI_API_KEY or GROQ_API_KEY

Core Concepts

Smart PDF Triage

Not all PDFs are equal. doc-cleaner auto-classifies before processing:

Type Characteristics Strategy
Native text char density ≥8, garbage <5%, short lines ≤70% Direct extraction (fast, free)
Layout-broken >70% short lines (tables crushed) opendataloader-pdf table extraction / AI vision + text
Scanned images char density <8 PDF-to-image + AI vision

Cost-effective workflow:

# Step 1: Extract all in no-AI mode (fast, free, private)
python cleaner.py --input ./documents/ --ai none --output-dir ./output/raw

# Step 2: Re-process only scanned files with AI
python cleaner.py --input scanned.pdf --ai gemini

Ad Cleaning

Taiwan bank statement PDFs often have investment notices, legal disclaimers, or promotions. Two mechanisms:

Mechanism Behavior Scenario
Tail truncation Remove everything after first match End-of-document disclaimers
Inline removal Remove individual matched paragraphs Promotional blocks in the middle

In config.json:

{
  "ad_truncation_patterns": ["謹慎理財.{0,20}信用至上"],
  "ad_strip_patterns": ["※運動賺回饋"]
}

Safety: if truncation would remove >70% of content, it's skipped with a warning. All regex validated at startup.

Table Preservation

Tables are first-class citizens:

  • DOCX: python-docx extracts directly → Markdown pipe tables
  • XLSX/CSV: pandas.to_markdown() — all sheets preserved
  • PDF: opendataloader-pdf produces proper pipe tables (no AI needed)
  • AI prompt: explicitly instructs to keep existing tables unchanged

Privacy and Security

Option Effect
--ai none Zero API keys, zero cloud — local extraction only
--ai ollama Local Ollama inference, documents stay on your machine
--ai gemini / --ai groq Cloud inference, higher quality

Other safeguards:

  • Atomic writes — temp file + os.replace(), no partial output
  • Secret isolation — API keys in .env only, startup validation
  • OOM protection — PDF vision defaults to 15 pages max
  • JSON fallback — if AI returns invalid JSON, degrades to raw text

Advanced Reference

CLI Options

python cleaner.py [options]

  --input, -i       File or directory to process (required, non-recursive)
  --output-dir, -o  Output directory (default: ./output)
  --config          Config file path (default: <script-dir>/config.json)
  --ai              gemini | groq | ollama | none (default: config or gemini)
  --password        PDF decryption password (overrides .env and config)
  --summary         Output JSON summary to stdout (for scripts/agents)
  --dry-run         Preview without writing
  --verbose         Debug logging
  --version         Version info

Exit codes: 0 = success · 1 = partial failure · 2 = config error

Configuration (config.json)

{
  "ai": {
    "backend": "gemini",                      // default backend
    "prompt_template": "prompts/default.txt", // prompt path
    "gemini": { "model": "gemini-2.5-pro" },
    "groq": {
      "model": "meta-llama/llama-4-scout-17b-16e-instruct",
      "timeout": 120
    },
    "ollama": {
      "model": "qwen3.5:9b",
      "host": "http://localhost:11434"
    }
  },
  "pdf": {
    "dpi": 200,
    "max_pages": 15
  },
  "output": { "frontmatter": true },
  "ad_truncation_patterns": ["謹慎理財.{0,20}信用至上"],
  "ad_strip_patterns": ["※運動賺回饋"]
}

Secret management: API keys belong in .env, never in config.json. Validated at startup.

# .env example
GEMINI_API_KEY=...
GROQ_API_KEY=...
PDF_PASSWORD=...

Custom AI Prompt Templates

doc-cleaner includes 2 templates:

File Purpose
prompts/default.txt General document cleaning
prompts/finance.txt Bank statements, financial reports

Create your own: in prompts/, AI must output JSON:

{
  "title": "Short title",
  "summary": "1-2 sentence summary",
  "refined_markdown": "Full cleaned Markdown",
  "tags": ["tag1", "tag2"]
}

Ollama Model Recommendations

Table reconstruction is demanding; small models struggle. If your machine has resources, qwen3.5 series natively supports vision:

Model Size Vision Tables CJK Notes
qwen3.5:27b 17 GB Good Excellent Best results
qwen3.5:9b 6.6 GB Fair Good Default, balanced
qwen3.5:4b 3.4 GB Poor Fair Lightweight
qwen3:30b 19 GB Good Excellent MoE, fast, no vision

Recommendation: qwen3.5:9b handles scanned PDFs; qwen3:30b for fast inference on native-text-only. 8GB RAM users should use --ai gemini or --ai none.

Supported Formats Reference

Format Parser Tables Notes
PDF (native) PyMuPDF find_tables() / opendataloader-pdf pipe tables find_tables needs no extra install; ODL needs Java
PDF (scanned) pdf2image → AI vision AI rebuild Needs poppler (optional)
PDF (encrypted) pikepdf pipe tables Optional
DOCX python-docx pipe tables Cross-platform
XLSX / XLS pandas + xlrd pipe tables All sheets
CSV pandas pipe tables Auto-encoding detection
PPTX python-pptx pipe tables Slides + speaker notes
PPT macOS textutil / LibreOffice macOS built-in; Windows needs LibreOffice
DOC macOS textutil / LibreOffice macOS built-in; Windows needs LibreOffice
DXF ezdxf Engineering: annotations, dimensions
TXT / MD stdlib Big5/CP950/UTF-16
JSONL built-in Claude Code session transcript → Markdown
NUMBERS numbers-parser pipe tables Apple spreadsheet, one section per table
KEY keynote-parser Apple Keynote, one section per slide (IWA)
PAGES QuickLook PDF Apple Pages; modern files need Export → PDF first
EPUB built-in (lxml) E-book, one section per chapter, with title/author

Integration & Ecosystem

AI Agent Frameworks

doc-cleaner is a standard CLI — any AI agent framework can shell-call it. Ships with SKILL.md for OpenClaw.

# Agent example: process + JSON summary
python cleaner.py --input document.pdf --ai none --summary

--summary output:

{"version":"1.0.0","total":1,"success":1,"failed":0,"files":[{"file":"document.pdf","output":"./output/document.md","status":"ok"}]}

notoriouslab Pipeline

gmail-statement-fetcher  →  Auto-fetch PDFs from Gmail
          ↓
    doc-cleaner          →  PDF/DOCX/XLSX → structured Markdown
          ↓
   personal-cfo          →  Monthly audit + retirement planning (in development)

Each tool stands alone; together they form a complete personal finance pipeline.


Contributing

The easiest contributions:

  1. Add ad regex patterns for your bank — add rules to config.example.json
  2. Add prompt templates — create a .txt file in prompts/
  3. Report encoding issues — include anonymized samples and logs

See CONTRIBUTING.md.


Star History

Star History Chart


License

This project's code is licensed under MIT.

Bundled open-source libraries

doc-cleaner stands on these libraries; listed honestly with their licenses:

Library Role License
PyMuPDF PDF parsing & table detection AGPL-3.0 / commercial dual license
python-docx DOCX MIT
pandas · openpyxl · xlrd XLSX / XLS / CSV BSD / MIT
python-pptx PPTX MIT
ezdxf DXF MIT
numbers-parser Apple Numbers MIT
keynote-parser Apple Keynote MIT
lxml EPUB / XML parsing BSD
Pillow image handling HPND
pywebview desktop GUI BSD
tabulate Markdown table output MIT

PyMuPDF is dual-licensed under AGPL-3.0 / commercial; this project's source is public on GitHub, satisfying AGPL's source-availability requirement for distribution.