This project parses PDF documents, extracts headings, and classifies the remaining text into structured data. The pipeline combines PDF parsing, heading extraction using NLP models, and text classification using ML algorithms, producing a clean CSV output for further analysis.
flowchart TD
A[PDF Files] --> B[Parser]
B -->|CSV| C[Heading Extractor]
C -->|Headings CSV| D[Classifier]
C -->|Remaining Text| D
D -->|Structured CSV| E[Final Output]
Parses raw PDF documents and extracts text into CSV format.
Libraries used:
Identifies headings and structures the document hierarchy.
Models explored:
Classifies the remaining text (non-headings) into categories.
Algorithms used:
The pipeline generates:
- Headings CSV → Extracted headings with hierarchy levels
- Classified CSV → Classified text segments mapped under respective headings
Input: research_paper.pdf
Output:
headings.csv(list of extracted headings)classified.csv(structured classified text under headings)
- Improve heading extraction accuracy with fine-tuned NLP models
- Support multi-column and scanned PDFs (OCR integration with Tesseract)
- Export results to JSON for easier integration with other tools
Pull requests are welcome. For major changes, please open an issue first to discuss what you’d like to change.