Skip to content

Dr-Anoushka-Tomar/Project-Phoenix-PDAC-XAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project_Pheonix_PC_XAI_2025

PDAC Classification Pipeline — Multi-Modal CT-Based Detection of Pancreatic Ductal Adenocarcinoma

This repository contains a full end-to-end pipeline for binary classification of Pancreatic Ductal Adenocarcinoma (PDAC) from CT scans, using a combination of radiomics feature extraction, 3D deep learning (MONAI ResNet), and ensemble fusion classifiers. The pipeline is implemented across seven Google Colab notebooks, each handling a distinct stage of the workflow.


Table of Contents

  1. Overview
  2. Dataset
  3. Pipeline Summary
  4. Notebooks
  5. Dependencies
  6. Project Structure
  7. Output Artefacts

Overview

The goal of this project is to classify CT scan volumes as either PDAC-positive (case) or healthy control, using a multi-stage hybrid approach:

  • Radiomics: Hand-crafted texture, shape, and intensity features extracted via PyRadiomics from segmented regions of interest.
  • 3D Deep Learning: A 3D ResNet-10 backbone (MONAI) trained on dual-channel inputs (CT image + segmentation mask).
  • Ensemble Fusion: Probability outputs from both streams are combined via average fusion and stacked meta-learners (XGBoost, Random Forest, Logistic Regression).
  • Explainability: Grad-CAM heatmaps are generated to visualise which regions of the CT scan drive model predictions.

Dataset

The pipeline uses two publicly available collections from The Cancer Imaging Archive (TCIA):

Collection Label Modality
CPTAC-PDA Case (1) — PDAC positive CT + RTSTRUCT segmentations
Pancreas-CT Control (0) — Healthy pancreas CT + NIfTI segmentation labels

Data is downloaded using TCIA manifest files via the tcia_utils / nbia Python library, and stored in Google Drive.


Pipeline Summary

TCIA DICOM Download
        ↓
DICOM → NPY Conversion (HU windowing, isotropic resampling, 3D volume + 2D slices)
        ↓
Segmentation Mask Generation (RTSTRUCT → 3D binary mask; NIfTI → numpy mask)
        ↓
 ┌──────────────────────┬────────────────────────────────┐
 │   Radiomics Branch   │      Deep Learning Branch      │
 │  PyRadiomics features│  3D ResNet-10 (MONAI)          │
 │  per ROI (RTSTRUCT / │  Input: [CT image, mask] 2-ch  │
 │  NIfTI masks)        │  OOF mid-layer feature extract │
 └──────────┬───────────┴───────────┬────────────────────┘
            │                       │
            └──────────┬────────────┘
              NeuroHarmonize (ComBat scanner harmonisation)
                        ↓
       Ensemble Fusion (XGBoost / RF / LR)
       + Stacked Meta-Learner (Logistic Regression)
                        ↓
              ROC / AUC / SHAP / Calibration

Notebooks

1. DICOM Download and Preprocessing

File: DICOM_download_and_preprocessing_from_scratch.ipynb

This notebook handles data acquisition and initial metadata management.

What it does:

  • Mounts Google Drive and installs required libraries (tcia_utils, mirp, altair, simpleDicomViewer).
  • Downloads DICOM image series from TCIA using manifest .tcia files for three collections: CPTAC-PDA (cancer), CPTAC-PDA Negative Assessments, and Pancreas-CT (healthy controls).
  • Downloads the corresponding RTSTRUCT tumour annotation files for CPTAC-PDA.
  • Extracts patient-level metadata (patient list, study details, series reports) via nbia API calls and saves them as CSV files.
  • Visualises annotation metadata — scatter plots of StructureSetLabel by patient and ROI volumes by patient/timepoint using Altair.
  • Merges CPTAC-PDA and Pancreas-CT metadata CSVs, filters to CT-only rows, and adds a case_control_status column (1 = case, 0 = control).
  • Adds DICOM_folder_path columns for downstream volume loading.
  • Merges segmentation metadata from two sources (RTSTRUCT series metadata + annotation report CSV) and saves a combined segmentation manifest.
  • Uncompresses NIfTI .nii.gz label files for the Pancreas-CT collection.
  • Runs a basic radiomics pipeline using PyRadiomics and SimpleITK, including RTSTRUCT-to-mask rasterisation, CT series loading, and per-ROI feature extraction with a checkpoint/resume mechanism.

Key outputs:

  • CPTAC_PDA_merged.csv — merged case/control CT metadata
  • merged_seg_with_paths.csv — segmentation file paths linked to subjects
  • radiomics_features.csv — extracted radiomics features per ROI

2. DICOM to NPY Conversion

File: DICOM_to_NPY_conversion.ipynb

Converts raw DICOM CT series into normalised, isotropically-resampled 3D NumPy volumes ready for deep learning.

What it does:

  • Reads the merged metadata CSV and iterates over each subject's DICOM_folder_path.
  • Loads and sorts DICOM slices using pydicom, ordered by ImagePositionPatient z-coordinate, SliceLocation, or InstanceNumber.
  • Applies Rescale Slope and Intercept to convert pixel values to Hounsfield Units (HU).
  • Determines voxel spacing from PixelSpacing and SliceThickness DICOM tags (falling back to median z-diff if needed).
  • Resamples each volume to 1.0 × 1.0 × 1.0 mm isotropic spacing using scipy.ndimage.zoom with trilinear interpolation (order=1).
  • Applies HU windowing: clips to [−160, 240] HU and normalises to [0, 1].
  • Saves the full 3D volume as a single .npy file and also saves individual 2D axial slices as separate .npy files per subject.
  • Implements resumability: skips already-processed subjects and saves a progress CSV after each subject.
  • Processes 223 subjects in the run shown, taking approximately 3 hours 38 minutes on CPU.

Key configuration:

TARGET_SPACING = [1.0, 1.0, 1.0] mm
HU_WINDOW = [−160, 240]
Resampling order = 1 (trilinear)

Key outputs:

  • output_3d_volumes/<subject_id>.npy — one 3D float32 volume per subject
  • output_2d_slices/<subject_id>/slice_NNN.npy — per-slice 2D arrays
  • Updated CSV with path_image_3d_npy and path_slices_dir_npy columns

3. Segmentation Mask Generation

File: Segmentation_masks_generation.ipynb

Converts RTSTRUCT DICOM files and NIfTI label files into 3D binary NumPy mask arrays aligned to the CT volumes.

What it does:

  • Reads merged_seg_with_paths.csv and filters rows where Annotation Type == "Segmentation".
  • For NIfTI inputs: loads via nibabel, thresholds float arrays at > 0, and returns a uint8 binary mask.
  • For RTSTRUCT inputs:
    • Resolves the RTSTRUCT .dcm file (either direct path or by scanning a folder for Modality == RTSTRUCT).
    • Loads the referenced CT DICOM series and sorts slices by ImagePositionPatient z.
    • Rasterises contour polygons from ROIContourSequence into a 3D binary mask by mapping patient (x, y, z) coordinates to pixel indices using the CT's ImagePositionPatient, ImageOrientationPatient, and PixelSpacing DICOM tags, then using skimage.draw.polygon.
    • Matches each contour's median z-coordinate to the nearest CT slice (within a 1.0 mm tolerance).
  • Implements a search utility to locate a DICOM series folder by SeriesInstanceUID, scanning base directories with both fast name-match and full DICOM-header fallback.
  • Resumable: updates and saves the CSV after each successfully generated mask. Rows with errors are skipped and can be retried.
  • Masks are saved as .npy files and the mask_path column is populated in the CSV.

Key outputs:

  • segmentation_masks/<subject_id>_<series_id>.npy — 3D binary uint8 mask arrays
  • Updated merged_seg_with_paths.csv with mask_path column

4. Baseline 3D CNN Model

File: Baseline_model_final.ipynb

A TensorFlow/Keras-based 3D CNN trained as a baseline classifier, with optional dual-channel (CT + mask) input and Grad-CAM visualisation.

What it does:

Training:

  • Reads the merged CSV and dynamically determines whether segmentation masks are available; sets CHANNELS = 2 (image + mask) if masks are found on disk, otherwise CHANNELS = 1.
  • Resizes volumes to 128 × 128 × 128 using scipy zoom (order=1 for image, order=0 for mask), then normalises image to [0, 1].
  • Applies stratified 70/15/15 train/val/test split.
  • Builds a small 3D CNN: three Conv3D–BatchNorm–MaxPool3D blocks (16→32→64 filters), GlobalAveragePooling3D, Dense(128) + Dropout(0.4), sigmoid output.
  • Trains with binary_crossentropy, Adam, balanced class weights, and callbacks for model checkpointing (best val AUC), ReduceLROnPlateau, EarlyStopping (patience=8), and a custom HistorySaver that appends epoch metrics to a CSV after every epoch.
  • Supports resume from checkpoint: reads the saved model and history CSV to continue training from the last completed epoch.
  • Saves final model as .keras, test predictions, and training history plot (Loss + AUC vs Epoch).

Grad-CAM:

  • Two versions are implemented. The first loads a single-channel model and generates Grad-CAM using GradientTape on the last Conv3D layer; the second handles the dual-channel model with both CT and mask channels.
  • Upsamples the 3D heatmap back to the original volume resolution, overlays it on 6 axial slices, and saves the result as a PNG.

Key outputs:

  • best_model.keras / final_trained_model.keras
  • roc_curve.png, training_history_plot.png
  • test_predictions.csv
  • gradcam_visuals/gradcam_<subject>.png

5. 3D ResNet with Segmentation Masks

File: 3D_ResNet.ipynb

A PyTorch/MONAI-based 3D ResNet-10 classifier that takes dual-channel (CT + mask) input, trained CPU-safely with checkpointing and early stopping.

What it does:

Dataset:

  • Loads image .npy and mask .npy for each subject. If a mask is unavailable, a zero-tensor is used as the mask channel.
  • Clips CT to [−1000, 1000] HU, scales to [0, 1], and resizes to 64 × 128 × 128 (D × H × W) using scipy.ndimage.zoom.
  • Constructs a 2-channel tensor [image, mask] of shape (2, D, H, W).
  • Training augmentations: joint random 90° rotations and depth-axis flips on image and mask, random multiplicative/additive intensity scaling, and small Gaussian noise injection.

Model:

  • MONAI resnet10 with n_input_channels=2 and num_classes=1.
  • All backbone convolutional layers are frozen; only the final fully-connected head is trained, making this viable on CPU.
  • Loss: BCEWithLogitsLoss with class-frequency-based positive weight.
  • Optimiser: AdamW (lr=1e-4, weight_decay=1e-5).
  • 70/15/15 stratified split; trains for up to 50 epochs with patience=10 early stopping on validation AUC.
  • Checkpoint saved every epoch; best model saved on AUC improvement.

Evaluation and plots:

  • Computes test AUC, saves loss curve, validation AUC curve, ROC curve, and confusion matrix as PNG files.

Grad-CAM (within the same notebook):

  • Loads the best saved checkpoint back into MONAI ResNet-10.
  • Hooks the last Conv3d layer to capture activations and gradients.
  • For each randomly selected sample, generates a 3D Grad-CAM heatmap (global-average-pooled gradient weighting), saves a 3-panel figure: CT mid-axial slice, segmentation mask, and CT with Grad-CAM overlay.

Key outputs:

  • checkpoint_cpu_masks.pth, best_model_cpu_masks.pth
  • loss_curve.png, val_auc_curve.png, roc_curve.png, confusion_matrix.png
  • gradcam_outputs/<subject>_gradcam.png

6. Grad-CAM Visualisation (ResNet — Improved)

File: ResNet_Grad_CAM.ipynb

A standalone, improved Grad-CAM visualisation notebook for the trained MONAI ResNet-18 checkpoint, producing smooth, upsampled heatmaps with mask contour overlays.

What it does:

  • Forces CPU-only mode and disables GPU/TensorFlow to avoid CUDA conflicts in Colab.
  • Loads the MONAI ResNet-18 checkpoint (saved as state dict or dict-with-state-dict) using strict=False.
  • Implements a GradCAM3D class that:
    • Hooks model.layer4[-1].conv2 (last conv of the deepest ResNet block) for activations and gradients.
    • Computes channel weights by global-average pooling of gradients.
    • Upsamples the resulting 3D heatmap to the original input resolution using trilinear interpolation (F.interpolate).
    • Applies volumetric Gaussian smoothing (sigma=1.0) to reduce blockiness.
  • Downsamples input volumes to a maximum of 128 on any axis to manage CPU memory.
  • Resamples segmentation masks to match image shape using nearest-neighbour zoom with shape correction.
  • For each of 3 randomly selected samples, generates a 3-column figure:
    • Original CT mid-axial slice
    • CT with mask overlay (semi-transparent red)
    • CT with smooth Grad-CAM heatmap overlay (jet colourmap) + white segmentation contour drawn using skimage.measure.find_contours
  • Adds a colour bar to the heatmap panel.

Key outputs:

  • Multi-panel PNG per subject showing original CT, mask overlay, and upsampled Grad-CAM

7. XGBoost / LR Fusion Pipeline

File: XGB____LR_Fusion_pipeline.ipynb

The final classification and ensemble fusion stage, combining radiomics features with out-of-fold (OOF) deep ResNet features, applying scanner harmonisation, and evaluating multiple classifier variants.

What it does:

Step 1 — ROI filtering:

  • Loads the full radiomics CSV and normalises ROI names to canonical tokens (PANCREAS-1, PANCREAS-V2, PANCREATIC DUCT, NO FINDINGS).
  • Selects one row per subject by ROI priority; falls back to first occurrence if no canonical ROI is present.
  • Saves a filtered one-row-per-subject radiomics CSV.

Step 2 — OOF Deep Feature Extraction:

  • Loads the filtered radiomics CSV to obtain per-subject series_uid_used and labels.
  • Performs 5-fold stratified cross-validation: for each fold, trains a MONAI resnet18 from scratch (or from an external checkpoint) on the training subjects, then extracts mid-layer features (from layer3 or layer2, followed by AdaptiveAvgPool3d) for the held-out subjects.
  • This produces out-of-fold deep features that are label-leakage free.
  • Merges OOF deep features with the radiomics CSV to create a combined fusion feature table.

Step 3 — Scanner Harmonisation:

  • Removes shape, diagnostic, and metadata columns from the feature table.
  • Applies ComBat harmonisation via neuroHarmonize using SITE (dataset origin: CPTAC-PDA vs Pancreas-CT) as the batch covariate, correcting for scanner-driven distributional differences.

Step 4 — Ensemble Fusion (three classifier variants):

Three separate fusion pipelines are implemented and evaluated independently:

Variant Base learner
XGB fusion XGBoost
RF fusion Random Forest
LR fusion Logistic Regression

Each pipeline:

  • Splits data 80/20 (stratified) into train and held-out test sets.
  • Uses an inner 5-fold OOF stacking loop on the training set to train separate radiomics and deep feature pipelines (StandardScaler → PCA(95% variance) → base learner).
  • Trains a LogisticRegression meta-learner on the OOF probability outputs from both branches.
  • Evaluates four fusion strategies: radiomics-only, deep-only, average fusion, and stacked meta-learner.
  • Runs 5-fold nested outer CV for robust, unbiased AUC estimates.
  • Runs 30 repeated random splits and records AUC per repeat.
  • Computes bootstrap AUC confidence intervals (2000 resamples) and paired bootstrap significance tests comparing average fusion to each individual modality.
  • Maps SHAP values (or LR coefficients) from PCA space back to original radiomics features for interpretability; tracks feature stability across repeats.
  • Generates calibration curves, sensitivity/specificity with Clopper-Pearson CIs, confusion matrices, and ROC plots.
  • Runs a data leakage check: detects duplicate rows, zero-variance features, ID column candidates, and feature vector contradictions.

Key outputs (per classifier variant):

  • stacked_pipelines_models.joblib
  • main_run_auc_bootstrap_ci.csv
  • nested_outer_cv_results.csv, nested_outer_cv_summary.csv
  • repeats_stacking_results.csv, repeats_stacking_summary_stats.csv
  • paired_bootstrap_avg_vs_baselines_repeats.csv
  • feature_stability_top10_counts.csv
  • radiomics_feature_importance_mapped_from_shap_main.csv
  • ROC, calibration, and confusion matrix PNG files
  • leakage_report.json

Dependencies

pip install pydicom rt_utils nibabel SimpleITK scikit-image
pip install monai torch torchvision
pip install tensorflow keras
pip install pandas numpy scipy tqdm matplotlib
pip install pyradiomics
pip install xgboost scikit-learn shap joblib
pip install neuroHarmonize neuroCombat
pip install tcia_utils altair openpyxl

All notebooks are designed to run in Google Colab with Google Drive mounted at /content/drive/MyDrive/TCIA_Data/.


Project Structure

TCIA_Data/
├── CPTAC-PDA/                        # Downloaded CPTAC-PDA DICOM series
├── CPTAC_Negative/                   # CPTAC-PDA negative assessment DICOMs
├── CPTAC-PDA_Segementation/          # RTSTRUCT annotation DICOMs
├── Healthy_Pancreas_CT/              # Pancreas-CT DICOM series
├── NIfTI_Uncompressed/               # Decompressed NIfTI label files
├── output_3d_volumes/                # Per-subject 3D .npy CT volumes
├── output_2d_slices/                 # Per-subject 2D slice .npy arrays
├── segmentation_masks/               # Generated binary mask .npy arrays
├── radiomics_output/                 # Raw radiomics feature CSVs
├── pancreas_training_with_masks/     # ResNet training outputs + checkpoints
│   └── gradcam_outputs/             # Grad-CAM visualisation PNGs
├── outputs_3dcnn_2/                  # Baseline CNN outputs + Grad-CAM visuals
├── New_XGB_fusion_results/           # XGBoost fusion artefacts
├── New_RF_fusion_results/            # Random Forest fusion artefacts
├── New_LR_fusion_results/            # Logistic Regression fusion artefacts
├── CPTAC_PDA_merged.csv              # Master case/control metadata CSV
├── merged_seg_with_paths.csv         # Segmentation paths + subject metadata
├── fusion_radiomics_filtered.csv     # One-row-per-subject filtered radiomics
└── oof_middle_layer_fusion_features.csv  # Combined radiomics + OOF deep features

Output Artefacts

Artefact Description
best_model_cpu_masks.pth Best MONAI ResNet-10 checkpoint (by val AUC)
final_trained_model.keras Best Keras baseline 3D CNN
radiomics_features.csv Per-subject per-ROI radiomics features
oof_middle_layer_fusion_features.csv Leakage-free deep + radiomics features
main_run_auc_bootstrap_ci.csv AUC with 95% CI for all four fusion models
nested_outer_cv_summary.csv Mean ± std AUC across outer CV folds
feature_stability_top10_counts.csv Radiomics feature stability across repeats
*_gradcam.png Grad-CAM overlays with mask contours
leakage_report.json Data integrity and leakage check summary

💡 All notebooks were developed and executed in Google Colab (NVIDIA Tesla T4/V100, 12–24 GB VRAM).


📊 Dataset

This project uses two publicly available datasets from The Cancer Imaging Archive (TCIA):

Dataset Cohort Cases Label Notes
CPTAC-PDA Tumour ~103 subjects PDAC-positive Multi-vendor CT; RTSTRUCT annotations
Pancreas-CT Control 80 subjects PDAC-negative NIH Clinical Center; NIfTI annotations

CT Characteristics:

  • Pancreas-CT: Siemens/Philips MDCT at 120 kVp, 512×512 matrix, portal venous phase (~70s post-contrast), 1.5–2.5 mm slice thickness.
  • CPTAC-PDA: Variable acquisition parameters, multi-vendor, contrast-enhanced and non-contrast studies included.

Label Distribution:

  • CPTAC-PDA: 53M / 27F, mean age 46.8 ± 16.7 years (controls); tumour cohort mean age ~65 years.
  • Binary classification: PDAC (1) vs. healthy pancreas (0).

🛠️ Requirements

pip install pydicom nibabel SimpleITK numpy scipy scikit-image
pip install pyradiomics neuroHarmonize
pip install monai torch torchvision
pip install scikit-learn xgboost shap
pip install matplotlib seaborn pandas
Library Purpose
pydicom, SimpleITK DICOM loading and image processing
nibabel NIfTI segmentation file handling
scipy, scikit-image Resampling, rasterisation
PyRadiomics Handcrafted radiomics feature extraction
NeuroHarmonize ComBat scanner harmonisation
MONAI, PyTorch 3D ResNet-10, volumetric deep learning
scikit-learn PCA, Logistic Regression, cross-validation
XGBoost Non-linear ensemble classification
SHAP Radiomics explainability
Grad-CAM (custom) Deep learning spatial explainability
matplotlib, seaborn Visualisation

All experiments were run on Google Colab with GPU acceleration (Tesla T4 / V100). No local GPU is required.


👩‍💻 Author

Anoushka Tomar

📧 anoushkatomar30@gmail.com
🔗 ORCID: 0009-0009-2676-0204


License

© 2026 Anoushka. Licensed under Apache 2.0.


This research uses publicly available datasets from The Cancer Imaging Archive (TCIA). We gratefully acknowledge CPTAC and the NIH for curating and maintaining these resources.

About

A leakage-free ML pipeline for early pancreatic cancer detection from CT scans — combining handcrafted radiomics (PyRadiomics, ComBat) with volumetric deep learning (3D ResNet-10, MONAI) and explainable AI (SHAP, Grad-CAM).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors