Project_Pheonix_PC_XAI_2025

PDAC Classification Pipeline — Multi-Modal CT-Based Detection of Pancreatic Ductal Adenocarcinoma

This repository contains a full end-to-end pipeline for binary classification of Pancreatic Ductal Adenocarcinoma (PDAC) from CT scans, using a combination of radiomics feature extraction, 3D deep learning (MONAI ResNet), and ensemble fusion classifiers. The pipeline is implemented across seven Google Colab notebooks, each handling a distinct stage of the workflow.

Overview

The goal of this project is to classify CT scan volumes as either PDAC-positive (case) or healthy control, using a multi-stage hybrid approach:

Radiomics: Hand-crafted texture, shape, and intensity features extracted via PyRadiomics from segmented regions of interest.
3D Deep Learning: A 3D ResNet-10 backbone (MONAI) trained on dual-channel inputs (CT image + segmentation mask).
Ensemble Fusion: Probability outputs from both streams are combined via average fusion and stacked meta-learners (XGBoost, Random Forest, Logistic Regression).
Explainability: Grad-CAM heatmaps are generated to visualise which regions of the CT scan drive model predictions.

Dataset

The pipeline uses two publicly available collections from The Cancer Imaging Archive (TCIA):

Collection	Label	Modality
CPTAC-PDA	Case (1) — PDAC positive	CT + RTSTRUCT segmentations
Pancreas-CT	Control (0) — Healthy pancreas	CT + NIfTI segmentation labels

Data is downloaded using TCIA manifest files via the tcia_utils / nbia Python library, and stored in Google Drive.

Pipeline Summary

TCIA DICOM Download
        ↓
DICOM → NPY Conversion (HU windowing, isotropic resampling, 3D volume + 2D slices)
        ↓
Segmentation Mask Generation (RTSTRUCT → 3D binary mask; NIfTI → numpy mask)
        ↓
 ┌──────────────────────┬────────────────────────────────┐
 │   Radiomics Branch   │      Deep Learning Branch      │
 │  PyRadiomics features│  3D ResNet-10 (MONAI)          │
 │  per ROI (RTSTRUCT / │  Input: [CT image, mask] 2-ch  │
 │  NIfTI masks)        │  OOF mid-layer feature extract │
 └──────────┬───────────┴───────────┬────────────────────┘
            │                       │
            └──────────┬────────────┘
              NeuroHarmonize (ComBat scanner harmonisation)
                        ↓
       Ensemble Fusion (XGBoost / RF / LR)
       + Stacked Meta-Learner (Logistic Regression)
                        ↓
              ROC / AUC / SHAP / Calibration

Notebooks

1. DICOM Download and Preprocessing

File: DICOM_download_and_preprocessing_from_scratch.ipynb

This notebook handles data acquisition and initial metadata management.

What it does:

Mounts Google Drive and installs required libraries (tcia_utils, mirp, altair, simpleDicomViewer).
Downloads DICOM image series from TCIA using manifest .tcia files for three collections: CPTAC-PDA (cancer), CPTAC-PDA Negative Assessments, and Pancreas-CT (healthy controls).
Downloads the corresponding RTSTRUCT tumour annotation files for CPTAC-PDA.
Extracts patient-level metadata (patient list, study details, series reports) via nbia API calls and saves them as CSV files.
Visualises annotation metadata — scatter plots of StructureSetLabel by patient and ROI volumes by patient/timepoint using Altair.
Merges CPTAC-PDA and Pancreas-CT metadata CSVs, filters to CT-only rows, and adds a case_control_status column (1 = case, 0 = control).
Adds DICOM_folder_path columns for downstream volume loading.
Merges segmentation metadata from two sources (RTSTRUCT series metadata + annotation report CSV) and saves a combined segmentation manifest.
Uncompresses NIfTI .nii.gz label files for the Pancreas-CT collection.
Runs a basic radiomics pipeline using PyRadiomics and SimpleITK, including RTSTRUCT-to-mask rasterisation, CT series loading, and per-ROI feature extraction with a checkpoint/resume mechanism.

Key outputs:

CPTAC_PDA_merged.csv — merged case/control CT metadata
merged_seg_with_paths.csv — segmentation file paths linked to subjects
radiomics_features.csv — extracted radiomics features per ROI

2. DICOM to NPY Conversion

File: DICOM_to_NPY_conversion.ipynb

Converts raw DICOM CT series into normalised, isotropically-resampled 3D NumPy volumes ready for deep learning.

What it does:

Reads the merged metadata CSV and iterates over each subject's DICOM_folder_path.
Loads and sorts DICOM slices using pydicom, ordered by ImagePositionPatient z-coordinate, SliceLocation, or InstanceNumber.
Applies Rescale Slope and Intercept to convert pixel values to Hounsfield Units (HU).
Determines voxel spacing from PixelSpacing and SliceThickness DICOM tags (falling back to median z-diff if needed).
Resamples each volume to 1.0 × 1.0 × 1.0 mm isotropic spacing using scipy.ndimage.zoom with trilinear interpolation (order=1).
Applies HU windowing: clips to [−160, 240] HU and normalises to [0, 1].
Saves the full 3D volume as a single .npy file and also saves individual 2D axial slices as separate .npy files per subject.
Implements resumability: skips already-processed subjects and saves a progress CSV after each subject.
Processes 223 subjects in the run shown, taking approximately 3 hours 38 minutes on CPU.

Key configuration:

TARGET_SPACING = [1.0, 1.0, 1.0] mm
HU_WINDOW = [−160, 240]
Resampling order = 1 (trilinear)

Key outputs:

output_3d_volumes/<subject_id>.npy — one 3D float32 volume per subject
output_2d_slices/<subject_id>/slice_NNN.npy — per-slice 2D arrays
Updated CSV with path_image_3d_npy and path_slices_dir_npy columns

3. Segmentation Mask Generation

File: Segmentation_masks_generation.ipynb

Converts RTSTRUCT DICOM files and NIfTI label files into 3D binary NumPy mask arrays aligned to the CT volumes.

What it does:

Reads merged_seg_with_paths.csv and filters rows where Annotation Type == "Segmentation".
For NIfTI inputs: loads via nibabel, thresholds float arrays at > 0, and returns a uint8 binary mask.
For RTSTRUCT inputs:
- Resolves the RTSTRUCT .dcm file (either direct path or by scanning a folder for Modality == RTSTRUCT).
- Loads the referenced CT DICOM series and sorts slices by ImagePositionPatient z.
- Rasterises contour polygons from ROIContourSequence into a 3D binary mask by mapping patient (x, y, z) coordinates to pixel indices using the CT's ImagePositionPatient, ImageOrientationPatient, and PixelSpacing DICOM tags, then using skimage.draw.polygon.
- Matches each contour's median z-coordinate to the nearest CT slice (within a 1.0 mm tolerance).
Implements a search utility to locate a DICOM series folder by SeriesInstanceUID, scanning base directories with both fast name-match and full DICOM-header fallback.
Resumable: updates and saves the CSV after each successfully generated mask. Rows with errors are skipped and can be retried.
Masks are saved as .npy files and the mask_path column is populated in the CSV.

Key outputs:

segmentation_masks/<subject_id>_<series_id>.npy — 3D binary uint8 mask arrays
Updated merged_seg_with_paths.csv with mask_path column

4. Baseline 3D CNN Model

File: Baseline_model_final.ipynb

A TensorFlow/Keras-based 3D CNN trained as a baseline classifier, with optional dual-channel (CT + mask) input and Grad-CAM visualisation.

What it does:

Training:

Reads the merged CSV and dynamically determines whether segmentation masks are available; sets CHANNELS = 2 (image + mask) if masks are found on disk, otherwise CHANNELS = 1.
Resizes volumes to 128 × 128 × 128 using scipy zoom (order=1 for image, order=0 for mask), then normalises image to [0, 1].
Applies stratified 70/15/15 train/val/test split.
Builds a small 3D CNN: three Conv3D–BatchNorm–MaxPool3D blocks (16→32→64 filters), GlobalAveragePooling3D, Dense(128) + Dropout(0.4), sigmoid output.
Trains with binary_crossentropy, Adam, balanced class weights, and callbacks for model checkpointing (best val AUC), ReduceLROnPlateau, EarlyStopping (patience=8), and a custom HistorySaver that appends epoch metrics to a CSV after every epoch.
Supports resume from checkpoint: reads the saved model and history CSV to continue training from the last completed epoch.
Saves final model as .keras, test predictions, and training history plot (Loss + AUC vs Epoch).

Grad-CAM:

Two versions are implemented. The first loads a single-channel model and generates Grad-CAM using GradientTape on the last Conv3D layer; the second handles the dual-channel model with both CT and mask channels.
Upsamples the 3D heatmap back to the original volume resolution, overlays it on 6 axial slices, and saves the result as a PNG.

Key outputs:

best_model.keras / final_trained_model.keras
roc_curve.png, training_history_plot.png
test_predictions.csv
gradcam_visuals/gradcam_<subject>.png

5. 3D ResNet with Segmentation Masks

File: 3D_ResNet.ipynb

A PyTorch/MONAI-based 3D ResNet-10 classifier that takes dual-channel (CT + mask) input, trained CPU-safely with checkpointing and early stopping.

What it does:

Dataset:

Loads image .npy and mask .npy for each subject. If a mask is unavailable, a zero-tensor is used as the mask channel.
Clips CT to [−1000, 1000] HU, scales to [0, 1], and resizes to 64 × 128 × 128 (D × H × W) using scipy.ndimage.zoom.
Constructs a 2-channel tensor [image, mask] of shape (2, D, H, W).
Training augmentations: joint random 90° rotations and depth-axis flips on image and mask, random multiplicative/additive intensity scaling, and small Gaussian noise injection.

Model:

MONAI resnet10 with n_input_channels=2 and num_classes=1.
All backbone convolutional layers are frozen; only the final fully-connected head is trained, making this viable on CPU.
Loss: BCEWithLogitsLoss with class-frequency-based positive weight.
Optimiser: AdamW (lr=1e-4, weight_decay=1e-5).
70/15/15 stratified split; trains for up to 50 epochs with patience=10 early stopping on validation AUC.
Checkpoint saved every epoch; best model saved on AUC improvement.

Evaluation and plots:

Computes test AUC, saves loss curve, validation AUC curve, ROC curve, and confusion matrix as PNG files.

Grad-CAM (within the same notebook):

Loads the best saved checkpoint back into MONAI ResNet-10.
Hooks the last Conv3d layer to capture activations and gradients.
For each randomly selected sample, generates a 3D Grad-CAM heatmap (global-average-pooled gradient weighting), saves a 3-panel figure: CT mid-axial slice, segmentation mask, and CT with Grad-CAM overlay.

Key outputs:

checkpoint_cpu_masks.pth, best_model_cpu_masks.pth
loss_curve.png, val_auc_curve.png, roc_curve.png, confusion_matrix.png
gradcam_outputs/<subject>_gradcam.png

6. Grad-CAM Visualisation (ResNet — Improved)

File: ResNet_Grad_CAM.ipynb

A standalone, improved Grad-CAM visualisation notebook for the trained MONAI ResNet-18 checkpoint, producing smooth, upsampled heatmaps with mask contour overlays.

What it does:

Forces CPU-only mode and disables GPU/TensorFlow to avoid CUDA conflicts in Colab.
Loads the MONAI ResNet-18 checkpoint (saved as state dict or dict-with-state-dict) using strict=False.
Implements a GradCAM3D class that:
- Hooks model.layer4[-1].conv2 (last conv of the deepest ResNet block) for activations and gradients.
- Computes channel weights by global-average pooling of gradients.
- Upsamples the resulting 3D heatmap to the original input resolution using trilinear interpolation (F.interpolate).
- Applies volumetric Gaussian smoothing (sigma=1.0) to reduce blockiness.
Downsamples input volumes to a maximum of 128 on any axis to manage CPU memory.
Resamples segmentation masks to match image shape using nearest-neighbour zoom with shape correction.
For each of 3 randomly selected samples, generates a 3-column figure:
- Original CT mid-axial slice
- CT with mask overlay (semi-transparent red)
- CT with smooth Grad-CAM heatmap overlay (jet colourmap) + white segmentation contour drawn using skimage.measure.find_contours
Adds a colour bar to the heatmap panel.

Key outputs:

Multi-panel PNG per subject showing original CT, mask overlay, and upsampled Grad-CAM

7. XGBoost / LR Fusion Pipeline

File: XGB____LR_Fusion_pipeline.ipynb

The final classification and ensemble fusion stage, combining radiomics features with out-of-fold (OOF) deep ResNet features, applying scanner harmonisation, and evaluating multiple classifier variants.

What it does:

Step 1 — ROI filtering:

Loads the full radiomics CSV and normalises ROI names to canonical tokens (PANCREAS-1, PANCREAS-V2, PANCREATIC DUCT, NO FINDINGS).
Selects one row per subject by ROI priority; falls back to first occurrence if no canonical ROI is present.
Saves a filtered one-row-per-subject radiomics CSV.

Step 2 — OOF Deep Feature Extraction:

Loads the filtered radiomics CSV to obtain per-subject series_uid_used and labels.
Performs 5-fold stratified cross-validation: for each fold, trains a MONAI resnet18 from scratch (or from an external checkpoint) on the training subjects, then extracts mid-layer features (from layer3 or layer2, followed by AdaptiveAvgPool3d) for the held-out subjects.
This produces out-of-fold deep features that are label-leakage free.
Merges OOF deep features with the radiomics CSV to create a combined fusion feature table.

Step 3 — Scanner Harmonisation:

Removes shape, diagnostic, and metadata columns from the feature table.
Applies ComBat harmonisation via neuroHarmonize using SITE (dataset origin: CPTAC-PDA vs Pancreas-CT) as the batch covariate, correcting for scanner-driven distributional differences.

Step 4 — Ensemble Fusion (three classifier variants):

Three separate fusion pipelines are implemented and evaluated independently:

Variant	Base learner
XGB fusion	XGBoost
RF fusion	Random Forest
LR fusion	Logistic Regression

Each pipeline:

Splits data 80/20 (stratified) into train and held-out test sets.
Uses an inner 5-fold OOF stacking loop on the training set to train separate radiomics and deep feature pipelines (StandardScaler → PCA(95% variance) → base learner).
Trains a LogisticRegression meta-learner on the OOF probability outputs from both branches.
Evaluates four fusion strategies: radiomics-only, deep-only, average fusion, and stacked meta-learner.
Runs 5-fold nested outer CV for robust, unbiased AUC estimates.
Runs 30 repeated random splits and records AUC per repeat.
Computes bootstrap AUC confidence intervals (2000 resamples) and paired bootstrap significance tests comparing average fusion to each individual modality.
Maps SHAP values (or LR coefficients) from PCA space back to original radiomics features for interpretability; tracks feature stability across repeats.
Generates calibration curves, sensitivity/specificity with Clopper-Pearson CIs, confusion matrices, and ROC plots.
Runs a data leakage check: detects duplicate rows, zero-variance features, ID column candidates, and feature vector contradictions.

Key outputs (per classifier variant):

stacked_pipelines_models.joblib
main_run_auc_bootstrap_ci.csv
nested_outer_cv_results.csv, nested_outer_cv_summary.csv
repeats_stacking_results.csv, repeats_stacking_summary_stats.csv
paired_bootstrap_avg_vs_baselines_repeats.csv
feature_stability_top10_counts.csv
radiomics_feature_importance_mapped_from_shap_main.csv
ROC, calibration, and confusion matrix PNG files
leakage_report.json

Dependencies

pip install pydicom rt_utils nibabel SimpleITK scikit-image
pip install monai torch torchvision
pip install tensorflow keras
pip install pandas numpy scipy tqdm matplotlib
pip install pyradiomics
pip install xgboost scikit-learn shap joblib
pip install neuroHarmonize neuroCombat
pip install tcia_utils altair openpyxl

All notebooks are designed to run in Google Colab with Google Drive mounted at /content/drive/MyDrive/TCIA_Data/.

Project Structure

TCIA_Data/
├── CPTAC-PDA/                        # Downloaded CPTAC-PDA DICOM series
├── CPTAC_Negative/                   # CPTAC-PDA negative assessment DICOMs
├── CPTAC-PDA_Segementation/          # RTSTRUCT annotation DICOMs
├── Healthy_Pancreas_CT/              # Pancreas-CT DICOM series
├── NIfTI_Uncompressed/               # Decompressed NIfTI label files
├── output_3d_volumes/                # Per-subject 3D .npy CT volumes
├── output_2d_slices/                 # Per-subject 2D slice .npy arrays
├── segmentation_masks/               # Generated binary mask .npy arrays
├── radiomics_output/                 # Raw radiomics feature CSVs
├── pancreas_training_with_masks/     # ResNet training outputs + checkpoints
│   └── gradcam_outputs/             # Grad-CAM visualisation PNGs
├── outputs_3dcnn_2/                  # Baseline CNN outputs + Grad-CAM visuals
├── New_XGB_fusion_results/           # XGBoost fusion artefacts
├── New_RF_fusion_results/            # Random Forest fusion artefacts
├── New_LR_fusion_results/            # Logistic Regression fusion artefacts
├── CPTAC_PDA_merged.csv              # Master case/control metadata CSV
├── merged_seg_with_paths.csv         # Segmentation paths + subject metadata
├── fusion_radiomics_filtered.csv     # One-row-per-subject filtered radiomics
└── oof_middle_layer_fusion_features.csv  # Combined radiomics + OOF deep features

Output Artefacts

Artefact	Description
`best_model_cpu_masks.pth`	Best MONAI ResNet-10 checkpoint (by val AUC)
`final_trained_model.keras`	Best Keras baseline 3D CNN
`radiomics_features.csv`	Per-subject per-ROI radiomics features
`oof_middle_layer_fusion_features.csv`	Leakage-free deep + radiomics features
`main_run_auc_bootstrap_ci.csv`	AUC with 95% CI for all four fusion models
`nested_outer_cv_summary.csv`	Mean ± std AUC across outer CV folds
`feature_stability_top10_counts.csv`	Radiomics feature stability across repeats
`*_gradcam.png`	Grad-CAM overlays with mask contours
`leakage_report.json`	Data integrity and leakage check summary

💡 All notebooks were developed and executed in Google Colab (NVIDIA Tesla T4/V100, 12–24 GB VRAM).

📊 Dataset

This project uses two publicly available datasets from The Cancer Imaging Archive (TCIA):

Dataset	Cohort	Cases	Label	Notes
CPTAC-PDA	Tumour	~103 subjects	PDAC-positive	Multi-vendor CT; RTSTRUCT annotations
Pancreas-CT	Control	80 subjects	PDAC-negative	NIH Clinical Center; NIfTI annotations

CT Characteristics:

Pancreas-CT: Siemens/Philips MDCT at 120 kVp, 512×512 matrix, portal venous phase (~70s post-contrast), 1.5–2.5 mm slice thickness.
CPTAC-PDA: Variable acquisition parameters, multi-vendor, contrast-enhanced and non-contrast studies included.

Label Distribution:

CPTAC-PDA: 53M / 27F, mean age 46.8 ± 16.7 years (controls); tumour cohort mean age ~65 years.
Binary classification: PDAC (1) vs. healthy pancreas (0).

🛠️ Requirements

pip install pydicom nibabel SimpleITK numpy scipy scikit-image
pip install pyradiomics neuroHarmonize
pip install monai torch torchvision
pip install scikit-learn xgboost shap
pip install matplotlib seaborn pandas

Library	Purpose
`pydicom`, `SimpleITK`	DICOM loading and image processing
`nibabel`	NIfTI segmentation file handling
`scipy`, `scikit-image`	Resampling, rasterisation
`PyRadiomics`	Handcrafted radiomics feature extraction
`NeuroHarmonize`	ComBat scanner harmonisation
`MONAI`, `PyTorch`	3D ResNet-10, volumetric deep learning
`scikit-learn`	PCA, Logistic Regression, cross-validation
`XGBoost`	Non-linear ensemble classification
`SHAP`	Radiomics explainability
`Grad-CAM` (custom)	Deep learning spatial explainability
`matplotlib`, `seaborn`	Visualisation

All experiments were run on Google Colab with GPU acceleration (Tesla T4 / V100). No local GPU is required.

👩‍💻 Author

Anoushka Tomar

📧 anoushkatomar30@gmail.com
🔗 ORCID: 0009-0009-2676-0204

License

This research uses publicly available datasets from The Cancer Imaging Archive (TCIA). We gratefully acknowledge CPTAC and the NIH for curating and maintaining these resources.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
CSVs		CSVs
Notebooks		Notebooks
Results		Results
LICENSE		LICENSE
README.md		README.md
desktop.ini		desktop.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project_Pheonix_PC_XAI_2025

PDAC Classification Pipeline — Multi-Modal CT-Based Detection of Pancreatic Ductal Adenocarcinoma

Table of Contents

Overview

Dataset

Pipeline Summary

Notebooks

1. DICOM Download and Preprocessing

2. DICOM to NPY Conversion

3. Segmentation Mask Generation

4. Baseline 3D CNN Model

5. 3D ResNet with Segmentation Masks

6. Grad-CAM Visualisation (ResNet — Improved)

7. XGBoost / LR Fusion Pipeline

Dependencies

Project Structure

Output Artefacts

📊 Dataset

🛠️ Requirements

👩‍💻 Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project_Pheonix_PC_XAI_2025

PDAC Classification Pipeline — Multi-Modal CT-Based Detection of Pancreatic Ductal Adenocarcinoma

Table of Contents

Overview

Dataset

Pipeline Summary

Notebooks

1. DICOM Download and Preprocessing

2. DICOM to NPY Conversion

3. Segmentation Mask Generation

4. Baseline 3D CNN Model

5. 3D ResNet with Segmentation Masks

6. Grad-CAM Visualisation (ResNet — Improved)

7. XGBoost / LR Fusion Pipeline

Dependencies

Project Structure

Output Artefacts

📊 Dataset

🛠️ Requirements

👩‍💻 Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages