Skip to content

Alfaz-Ahmad/PDF-Heading-Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 PDF Heading Extractor & Classifier

This project parses PDF documents, extracts headings, and classifies the remaining text into structured data. The pipeline combines PDF parsing, heading extraction using NLP models, and text classification using ML algorithms, producing a clean CSV output for further analysis.


🚀 Project Workflow

flowchart TD
    A[PDF Files] --> B[Parser]
    B -->|CSV| C[Heading Extractor]
    C -->|Headings CSV| D[Classifier]
    C -->|Remaining Text| D
    D -->|Structured CSV| E[Final Output]
Loading

🧩 Components

1️⃣ Parser

Parses raw PDF documents and extracts text into CSV format.
Libraries used:


2️⃣ Heading Extractor

Identifies headings and structures the document hierarchy.
Models explored:


3️⃣ Classifier

Classifies the remaining text (non-headings) into categories.
Algorithms used:


📂 Output

The pipeline generates:

  • Headings CSV → Extracted headings with hierarchy levels
  • Classified CSV → Classified text segments mapped under respective headings

📊 Example

Input: research_paper.pdf
Output:

  • headings.csv (list of extracted headings)
  • classified.csv (structured classified text under headings)

🔮 Future Enhancements

  • Improve heading extraction accuracy with fine-tuned NLP models
  • Support multi-column and scanned PDFs (OCR integration with Tesseract)
  • Export results to JSON for easier integration with other tools

🤝 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you’d like to change.


About

This project parses PDF documents, extracts headings, and classifies the remaining text into structured data. The pipeline combines PDF parsing, heading extraction using NLP models, and text classification using ML algorithms, producing a clean CSV output for further analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages