A doctor-assistive AI system that interprets medical knowledge and patient images simultaneously. It utilizes a Dual-Encoder architecture to cross-reference textbook theory with visual pathology, generating clinically grounded diagnoses.
(The pipeline: Patient Image + Query -> Dual Vector Search -> Multimodal Reasoning -> Diagnosis)
This project addresses the "Modality Gap" in medical AI. Standard RAG systems are text-blind; they cannot "see" the X-Ray or Skin Lesion a doctor is asking about.
Our Solution: We built a multimodal pipeline that understands medical text and patient images together. By ingesting high-quality structured data (from Project 1), this system allows a clinician to upload a patient image and ask, "What is this condition and how should I treat it?". The AI then retrieves visually similar case studies and relevant medical literature to provide an evidence-based answer.
- Image-Aware Diagnosis: Unlike text-only models, this system matches patient photos with textbook diagrams for higher clinical confidence.
- High-Precision Alignment: Achieved 100% text retrieval accuracy and near-perfect image retrieval by using a shared semantic space for MiniLM and OpenCLIP vectors.
- Low Hallucination: Outputs are grounded in retrieved structured JSONL data (tables, diagrams, metadata), significantly reducing medical errors.
The pipeline consists of three advanced engineering stages:
We employ two specialized models to handle different data modalities:
- Text Stream: Uses MiniLM-L6-v2 to embed medical text, tables, and JSONL metadata.
- Visual Stream: Uses OpenCLIP ViT-B/32 to embed medical diagrams and patient images into the same vector space.
All vectors are stored in ChromaDB with rich metadata (page, coordinates, image path). When a query comes in:
- It finds the Top-N text chunks (Symptoms, Treatment).
- It finds visually similar diagrams for cross-verification.
The retrieved context (Text + Images) is fed into IDEFICS2-8B, a powerful Vision-Language Model. We use 4-bit Quantization to run this massive model efficiently on an NVIDIA L4 GPU, allowing it to "see" the retrieved evidence and generate a diagnosis.
###π₯ Model Setup (Critical)
Since requirements.txt only installs libraries, you need to set up the Model Weights (several GBs).
Option A: Auto-Download (Recommended) The scripts are designed to automatically download the models from HuggingFace on the first run.
- Run
embed_data.py: Downloadssentence-transformers/all-MiniLM-L6-v2andlaion/CLIP-ViT-B-32. - Run
ui_idefics_app.py: DownloadsHuggingFaceM4/idefics2-8b(approx 15GB). - Note: Ensure you have a stable internet connection for the first execution.
Option B: Manual Download (Offline Mode)
If you are on a restricted network, download these models manually from HuggingFace and update the paths in config or script variables:
- Text Embedding: sentence-transformers/all-MiniLM-L6-v2
- Image Embedding: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
- Inference Model: HuggingFaceM4/idefics2-8b
Follow these steps strictly to deploy the system.
git clone [https://github.com/revoker3661/Multimodal-Clinical-RAG-Assistant-Medical-Text-Image-Retrieval-System-.git]( https://github.com/revoker3661/Multimodal-Clinical-RAG-Assistant-Medical-Text-Image-Retrieval-System-.git)
cd Multimodal-Clinical-RAGpython -m venv venvFor Windows:
.\venv\Scripts\activatesource venv/bin/activatepip install -r requirements.txtEnsure your CUDA 12.8 environment is ready for 4-bit quantization.
python scripts/check_versions.pyYou must populate the vector database before asking questions.
Ensure your structured output and images are located in the input_data folder structure: β’ input_data/DavidsonMedicine24th/structured_output.jsonl β’ input_data/DavidsonMedicine24th/[...images...]
Run the embedding engine to process text and images using the Dual-Encoders.
python scripts/embed_data.py3. Validate DatabaseCheck if the vectors are correctly stored in ChromaDB.
python scripts/validate_db.pyLaunch the Streamlit Interface to interact with the AI doctor assistant.
# Launch the UI
streamlit run scripts/ui_idefics_app.py(Alternatively, you can run streamlit run scripts/streamlit_app.py if using the updated dashboard) How to use:
- Upload a patient image (optional) or ask a text question.
- The system retrieves related diagrams and textbook passages.
- IDEFICS2 analyzes the combined context and provides a diagnosis, causes, and treatment plan.
Plaintext
Multimodal-Clinical-RAG/
.
βββ exact_version.txt
βββ input_data/
β βββ DavidsonMedicine24th/
β β βββ [... approx 1000+ figure-xxx.jpg files ...]
β β βββ structured_output.jsonl
β βββ Firestein & Kelleyβs Textbook of Rheumatology, 2-Volume Set.../
β β βββ [... approx 1000+ figure-xxx.jpg files ...]
β β βββ structured_output.jsonl
β βββ Goldman-Cecil Medicine/
β βββ [... figure-xxx.jpg files ...]
β βββ structured_output.jsonl
βββ model_cache/
β βββ idefics2
βββ models/
β βββ idefics2/
β βββ models--HuggingFaceM4--idefics2-8b
βββ output_db/
β βββ 425d4a71-0f53-416b-a24a-c6796cdf880a/
β β βββ data_level0.bin
β β βββ header.bin
β β βββ index_metadata.pickle
β β βββ length.bin
β β βββ link_lists.bin
β βββ 5feea19d-1699-4cdd-8914-8c7afb6eaf58/
β β βββ data_level0.bin
β β βββ header.bin
β β βββ index_metadata.pickle
β β βββ length.bin
β β βββ link_lists.bin
β βββ chroma.sqlite3
βββ project_structure.txt
βββ requirements.txt
βββ scripts/
βββ __pycache__/
βββ sessions/
β βββ [... various .json session files ...]
βββ app.py
βββ check_versions.py
βββ download.py
βββ embed_data.py
βββ hello.py
βββ idefics_pipeline.py
βββ idefics_pipeline_v3.py
βββ idefics_pipeline_v3_fix.py
βββ idefics_pipeline_v4.py
βββ latestvalidate.py
βββ retrieve_app.py
βββ run_out.log
βββ streamlit_app.py
βββ test_chroma_query.py
βββ test_idefics2.py
βββ ui_idefics_app.py
βββ validate_db.py
βββ validate_embeddings.py
βββ README.md # π Manual
Component Technology Performance Text Embedding MiniLM-L6-v2 384-dim, High semantic overlap Image Embedding OpenCLIP ViT-B/32 512-dim, Zero-shot alignment Inference Engine IDEFICS2-8B 4-bit Quantized (BitsAndBytes) Hardware NVIDIA L4 GPU Efficient VRAM usage (~12GB)
This project is part of a larger research initiative to build scalable healthcare-grade multimodal RAG systems. Contributions are welcome!