Skip to content

revoker3661/Multimodal-Clinical-RAG-Assistant-Medical-Text-Image-Retrieval-System-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Multimodal Clinical RAG Assistant (Medical Text + Image Retrieval System)

Python Model Precision Hardware

A doctor-assistive AI system that interprets medical knowledge and patient images simultaneously. It utilizes a Dual-Encoder architecture to cross-reference textbook theory with visual pathology, generating clinically grounded diagnoses.


πŸ–ΌοΈ System Visualization

Project Thumbnail

(The pipeline: Patient Image + Query -> Dual Vector Search -> Multimodal Reasoning -> Diagnosis)


πŸ₯ Purpose & Clinical Impact

This project addresses the "Modality Gap" in medical AI. Standard RAG systems are text-blind; they cannot "see" the X-Ray or Skin Lesion a doctor is asking about.

Our Solution: We built a multimodal pipeline that understands medical text and patient images together. By ingesting high-quality structured data (from Project 1), this system allows a clinician to upload a patient image and ask, "What is this condition and how should I treat it?". The AI then retrieves visually similar case studies and relevant medical literature to provide an evidence-based answer.

🎯 Key Results

  • Image-Aware Diagnosis: Unlike text-only models, this system matches patient photos with textbook diagrams for higher clinical confidence.
  • High-Precision Alignment: Achieved 100% text retrieval accuracy and near-perfect image retrieval by using a shared semantic space for MiniLM and OpenCLIP vectors.
  • Low Hallucination: Outputs are grounded in retrieved structured JSONL data (tables, diagrams, metadata), significantly reducing medical errors.

βš™οΈ System Architecture

The pipeline consists of three advanced engineering stages:

1. Dual-Encoder Embedding Engine

We employ two specialized models to handle different data modalities:

  • Text Stream: Uses MiniLM-L6-v2 to embed medical text, tables, and JSONL metadata.
  • Visual Stream: Uses OpenCLIP ViT-B/32 to embed medical diagrams and patient images into the same vector space.

2. Hybrid Retrieval (ChromaDB)

All vectors are stored in ChromaDB with rich metadata (page, coordinates, image path). When a query comes in:

  • It finds the Top-N text chunks (Symptoms, Treatment).
  • It finds visually similar diagrams for cross-verification.

3. Multimodal Reasoning (IDEFICS2-8B)

The retrieved context (Text + Images) is fed into IDEFICS2-8B, a powerful Vision-Language Model. We use 4-bit Quantization to run this massive model efficiently on an NVIDIA L4 GPU, allowing it to "see" the retrieved evidence and generate a diagnosis.


###πŸ“₯ Model Setup (Critical)

Since requirements.txt only installs libraries, you need to set up the Model Weights (several GBs).

Option A: Auto-Download (Recommended) The scripts are designed to automatically download the models from HuggingFace on the first run.

  • Run embed_data.py: Downloads sentence-transformers/all-MiniLM-L6-v2 and laion/CLIP-ViT-B-32.
  • Run ui_idefics_app.py: Downloads HuggingFaceM4/idefics2-8b (approx 15GB).
  • Note: Ensure you have a stable internet connection for the first execution.

Option B: Manual Download (Offline Mode) If you are on a restricted network, download these models manually from HuggingFace and update the paths in config or script variables:

  1. Text Embedding: sentence-transformers/all-MiniLM-L6-v2
  2. Image Embedding: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
  3. Inference Model: HuggingFaceM4/idefics2-8b

πŸš€ Setup & Installation Guide

Follow these steps strictly to deploy the system.

Step 1: Clone the Repository

git clone [https://github.com/revoker3661/Multimodal-Clinical-RAG-Assistant-Medical-Text-Image-Retrieval-System-.git]( https://github.com/revoker3661/Multimodal-Clinical-RAG-Assistant-Medical-Text-Image-Retrieval-System-.git)
cd Multimodal-Clinical-RAG

Step 2: Create Environment

python -m venv venv

Activate environment

For Windows:

.\venv\Scripts\activate

For Linux/Mac:

source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Verify GPU & Versions

Ensure your CUDA 12.8 environment is ready for 4-bit quantization.

python scripts/check_versions.py

🧠 Data Ingestion Workflow

You must populate the vector database before asking questions.

1. Place Input Data

Ensure your structured output and images are located in the input_data folder structure: β€’ input_data/DavidsonMedicine24th/structured_output.jsonl β€’ input_data/DavidsonMedicine24th/[...images...]

2. Generate Embeddings

Run the embedding engine to process text and images using the Dual-Encoders.

python scripts/embed_data.py3. Validate Database

Check if the vectors are correctly stored in ChromaDB.

python scripts/validate_db.py

▢️ Usage: Clinical Diagnosis UI

Launch the Streamlit Interface to interact with the AI doctor assistant.

# Launch the UI
streamlit run scripts/ui_idefics_app.py

(Alternatively, you can run streamlit run scripts/streamlit_app.py if using the updated dashboard) How to use:

  1. Upload a patient image (optional) or ask a text question.
  2. The system retrieves related diagrams and textbook passages.
  3. IDEFICS2 analyzes the combined context and provides a diagnosis, causes, and treatment plan.

πŸ“‚ Project Structure

Plaintext
Multimodal-Clinical-RAG/
.
β”œβ”€β”€ exact_version.txt
β”œβ”€β”€ input_data/
 β”‚   β”œβ”€β”€ DavidsonMedicine24th/
 β”‚   β”‚   β”œβ”€β”€ [... approx 1000+ figure-xxx.jpg files ...]
β”‚   β”‚   └── structured_output.jsonl
β”‚   β”œβ”€β”€ Firestein & Kelley’s Textbook of Rheumatology, 2-Volume Set.../
β”‚   β”‚   β”œβ”€β”€ [... approx 1000+ figure-xxx.jpg files ...]
β”‚   β”‚   └── structured_output.jsonl
β”‚   └── Goldman-Cecil Medicine/
β”‚       β”œβ”€β”€ [... figure-xxx.jpg files ...]
β”‚       └── structured_output.jsonl
β”œβ”€β”€ model_cache/
 β”‚   └── idefics2
β”œβ”€β”€ models/
 β”‚   └── idefics2/
 β”‚       └── models--HuggingFaceM4--idefics2-8b
β”œβ”€β”€ output_db/
β”‚   β”œβ”€β”€ 425d4a71-0f53-416b-a24a-c6796cdf880a/
β”‚   β”‚   β”œβ”€β”€ data_level0.bin
β”‚   β”‚   β”œβ”€β”€ header.bin
β”‚   β”‚   β”œβ”€β”€ index_metadata.pickle
β”‚   β”‚   β”œβ”€β”€ length.bin
β”‚   β”‚   └── link_lists.bin
β”‚   β”œβ”€β”€ 5feea19d-1699-4cdd-8914-8c7afb6eaf58/
β”‚   β”‚   β”œβ”€β”€ data_level0.bin
β”‚   β”‚   β”œβ”€β”€ header.bin
β”‚   β”‚   β”œβ”€β”€ index_metadata.pickle
β”‚   β”‚   β”œβ”€β”€ length.bin
β”‚   β”‚   └── link_lists.bin
β”‚   └── chroma.sqlite3
β”œβ”€β”€ project_structure.txt
β”œβ”€β”€ requirements.txt
└── scripts/
    β”œβ”€β”€ __pycache__/
    β”œβ”€β”€ sessions/
    β”‚   β”œβ”€β”€ [... various .json session files ...]
    β”œβ”€β”€ app.py
    β”œβ”€β”€ check_versions.py
    β”œβ”€β”€ download.py
    β”œβ”€β”€ embed_data.py
    β”œβ”€β”€ hello.py
    β”œβ”€β”€ idefics_pipeline.py
    β”œβ”€β”€ idefics_pipeline_v3.py
    β”œβ”€β”€ idefics_pipeline_v3_fix.py
    β”œβ”€β”€ idefics_pipeline_v4.py
    β”œβ”€β”€ latestvalidate.py
    β”œβ”€β”€ retrieve_app.py
    β”œβ”€β”€ run_out.log
    β”œβ”€β”€ streamlit_app.py
    β”œβ”€β”€ test_chroma_query.py
    β”œβ”€β”€ test_idefics2.py
    β”œβ”€β”€ ui_idefics_app.py
    β”œβ”€β”€ validate_db.py
    └── validate_embeddings.py
└── README.md                    # πŸ“– Manual

πŸ“Š Performance Metrics

Component Technology Performance Text Embedding MiniLM-L6-v2 384-dim, High semantic overlap Image Embedding OpenCLIP ViT-B/32 512-dim, Zero-shot alignment Inference Engine IDEFICS2-8B 4-bit Quantized (BitsAndBytes) Hardware NVIDIA L4 GPU Efficient VRAM usage (~12GB)


🀝 Contributing

This project is part of a larger research initiative to build scalable healthcare-grade multimodal RAG systems. Contributions are welcome!

About

A doctor-assistive AI system that interprets medical knowledge and patient images simultaneously. It utilizes a Dual-Encoder architecture to cross-reference textbook theory with visual pathology, generating clinically grounded diagnoses.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages