Unstructured Data Pipeline

An intelligent agentic system powered by LangChain and Google Gemini that extracts structured invoice data from unstructured text files and stores it in SQLite databases.

Features

🤖 AI-Powered Extraction: Leverages Google Gemini LLM to intelligently parse invoice data
🔄 Agentic Workflow: Uses LangChain agents with tool calling for autonomous database operations
📊 Schema Validation: Enforces strict Pydantic schemas for invoice and item details
💾 SQLite Integration: Automatically creates and populates relational databases with foreign key constraints
🧵 Concurrent Processing: Multi-threaded file validation and reading for optimal performance
✅ Comprehensive Testing: Full test suite with pytest

Project Structure

Unstructured Data Pipeline/
├── src/
│   └── unstructured-data-pipeline/
│       ├── __init__.py
│       ├── config.py                      # Configuration and environment variables
│       ├── main.py                        # CLI entry point
│       ├── .env.local                     # Environment configuration
│       ├── .gitignore
│       │
│       ├── application/
│       │   └── pipelines/
│       │       └── llm_pipeline.py        # Agent creation and template formatting
│       │
│       ├── domain/
│       │   ├── prompts/
│       │   │   ├── human_instructions.py  # User input template
│       │   │   └── llm_instructions.py    # System prompt template
│       │   │
│       │   ├── schema/
│       │   │   ├── invoice.py             # Invoice and ItemDetails Pydantic models
│       │   │   └── llm_output.py          # LLM response schema
│       │   │
│       │   ├── tools/
│       │   │   ├── file_creation.py       # Database file creation tools
│       │   │   └── sql_execution.py       # SQL execution with retry logic
│       │   │
│       │   └── validators/
│       │       └── file_naming.py         # File path and type validators
│       │
│       ├── infrastructure/
│       │   ├── CLI/
│       │   │   └── main.py                # File reading with threading
│       │   │
│       │   └── llm_providers/
│       │       └── gemini_provider.py     # Google Gemini LLM initialization
│       │
│       └── saved_files/                   # Default database storage location
│
├── tests/
│   ├── __init__.py
│   ├── test_app.py                        # Comprehensive test suite
│   └── test_files/
│       ├── invoice_clean.txt
│       ├── invoice_email.txt
│       └── invoice_messy.txt
│
├── Makefile                               # Build automation commands
├── pyproject.toml                         # Project metadata and dependencies
├── run_tests.py                           # Test runner script
└── README.md                              # This file

Installation

Prerequisites

Python 3.10 or higher
Google Gemini API key

Setup

Clone the repository

git clone <repository-url>
cd "Unstructured Data Pipeline"

Install dependencies
```
make install
# or
pip install -e .
```

Configure environment variables

Create or edit src/unstructured-data-pipeline/.env.local:

FILES_STORAGE=/path/to/your/database/Data.db
FILE_TYPES_ALLOWED=txt
PROCESS_COMMAND=process <input_path> --db <output_db_path>
GOOGLE_GEMINI_API_KEY=your-api-key-here

Usage

Running the Pipeline

Navigate to the source directory and run the main script:

cd src/unstructured-data-pipeline
python main.py

CLI Commands

When prompted, use the following commands:

View available commands:
```
Commands
```

Process a file:

process <path/to/invoice.txt> --db <path/to/output.db>

Example:

process ../../tests/test_files/invoice_clean.txt --db ./saved_files/invoices.db

Exit the program:
```
Exit
```

How It Works

File Validation: The system validates file paths, types (.txt), and existence
Content Extraction: Files are read using multi-threaded workers for efficiency
AI Processing: Google Gemini analyzes the unstructured text and extracts invoice details
Database Operations: The agent autonomously:
- Checks if the database file exists
- Creates the database if needed
- Creates tables matching the invoice schema with foreign key relationships
- Inserts extracted data with proper validation
Result: Structured invoice data stored in a relational SQLite database

Testing

Run Tests

# Using the test runner
python run_tests.py

# Using Make
make test

# Using pytest directly
pytest tests/

Test Coverage

The test suite includes:

File validation (path, type, existence)
Concurrent file reading
Database file creation
SQL execution with error handling
Schema enforcement

Development

Code Quality

# Lint the code
make lint

# Format the code
make format

# Clean cache files
make clean

Dependencies

Core dependencies:

langchain - Agent framework
langchain-google-genai - Google Gemini integration
langgraph - Agent orchestration
pydantic - Schema validation
python-dotenv - Environment configuration

Development dependencies:

pytest - Testing framework
ruff - Linting and formatting

Schema

Invoice Schema

Invoice:
  - Sender_Name: str
  - Date: date
  - Item_Details: List[ItemDetails]
  - Total_Amount: float
  - Currency: str

ItemDetails:
  - Description: str
  - Quantity: str
  - Unit_price: int

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unstructured Data Pipeline

Features

Project Structure

Installation

Prerequisites

Setup

Usage

Running the Pipeline

CLI Commands

How It Works

Testing

Run Tests

Test Coverage

Development

Code Quality

Dependencies

Schema

Invoice Schema

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
src/unstructured-data-pipeline		src/unstructured-data-pipeline
tests		tests
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
run_tests.py		run_tests.py

Folders and files

Latest commit

History

Repository files navigation

Unstructured Data Pipeline

Features

Project Structure

Installation

Prerequisites

Setup

Usage

Running the Pipeline

CLI Commands

How It Works

Testing

Run Tests

Test Coverage

Development

Code Quality

Dependencies

Schema

Invoice Schema

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages