This repository contains a full end-to-end pipeline for binary classification of Pancreatic Ductal Adenocarcinoma (PDAC) from CT scans, using a combination of radiomics feature extraction, 3D deep learning (MONAI ResNet), and ensemble fusion classifiers. The pipeline is implemented across seven Google Colab notebooks, each handling a distinct stage of the workflow.
The goal of this project is to classify CT scan volumes as either PDAC-positive (case) or healthy control, using a multi-stage hybrid approach:
- Radiomics: Hand-crafted texture, shape, and intensity features extracted via PyRadiomics from segmented regions of interest.
- 3D Deep Learning: A 3D ResNet-10 backbone (MONAI) trained on dual-channel inputs (CT image + segmentation mask).
- Ensemble Fusion: Probability outputs from both streams are combined via average fusion and stacked meta-learners (XGBoost, Random Forest, Logistic Regression).
- Explainability: Grad-CAM heatmaps are generated to visualise which regions of the CT scan drive model predictions.
The pipeline uses two publicly available collections from The Cancer Imaging Archive (TCIA):
| Collection | Label | Modality |
|---|---|---|
| CPTAC-PDA | Case (1) — PDAC positive | CT + RTSTRUCT segmentations |
| Pancreas-CT | Control (0) — Healthy pancreas | CT + NIfTI segmentation labels |
Data is downloaded using TCIA manifest files via the tcia_utils / nbia Python library, and stored in Google Drive.
TCIA DICOM Download
↓
DICOM → NPY Conversion (HU windowing, isotropic resampling, 3D volume + 2D slices)
↓
Segmentation Mask Generation (RTSTRUCT → 3D binary mask; NIfTI → numpy mask)
↓
┌──────────────────────┬────────────────────────────────┐
│ Radiomics Branch │ Deep Learning Branch │
│ PyRadiomics features│ 3D ResNet-10 (MONAI) │
│ per ROI (RTSTRUCT / │ Input: [CT image, mask] 2-ch │
│ NIfTI masks) │ OOF mid-layer feature extract │
└──────────┬───────────┴───────────┬────────────────────┘
│ │
└──────────┬────────────┘
NeuroHarmonize (ComBat scanner harmonisation)
↓
Ensemble Fusion (XGBoost / RF / LR)
+ Stacked Meta-Learner (Logistic Regression)
↓
ROC / AUC / SHAP / Calibration
File: DICOM_download_and_preprocessing_from_scratch.ipynb
This notebook handles data acquisition and initial metadata management.
What it does:
- Mounts Google Drive and installs required libraries (
tcia_utils,mirp,altair,simpleDicomViewer). - Downloads DICOM image series from TCIA using manifest
.tciafiles for three collections:CPTAC-PDA(cancer),CPTAC-PDA Negative Assessments, andPancreas-CT(healthy controls). - Downloads the corresponding RTSTRUCT tumour annotation files for CPTAC-PDA.
- Extracts patient-level metadata (patient list, study details, series reports) via
nbiaAPI calls and saves them as CSV files. - Visualises annotation metadata — scatter plots of
StructureSetLabelby patient and ROI volumes by patient/timepoint using Altair. - Merges CPTAC-PDA and Pancreas-CT metadata CSVs, filters to CT-only rows, and adds a
case_control_statuscolumn (1 = case, 0 = control). - Adds
DICOM_folder_pathcolumns for downstream volume loading. - Merges segmentation metadata from two sources (
RTSTRUCTseries metadata + annotation report CSV) and saves a combined segmentation manifest. - Uncompresses NIfTI
.nii.gzlabel files for the Pancreas-CT collection. - Runs a basic radiomics pipeline using
PyRadiomicsandSimpleITK, including RTSTRUCT-to-mask rasterisation, CT series loading, and per-ROI feature extraction with a checkpoint/resume mechanism.
Key outputs:
CPTAC_PDA_merged.csv— merged case/control CT metadatamerged_seg_with_paths.csv— segmentation file paths linked to subjectsradiomics_features.csv— extracted radiomics features per ROI
File: DICOM_to_NPY_conversion.ipynb
Converts raw DICOM CT series into normalised, isotropically-resampled 3D NumPy volumes ready for deep learning.
What it does:
- Reads the merged metadata CSV and iterates over each subject's
DICOM_folder_path. - Loads and sorts DICOM slices using
pydicom, ordered byImagePositionPatientz-coordinate,SliceLocation, orInstanceNumber. - Applies Rescale Slope and Intercept to convert pixel values to Hounsfield Units (HU).
- Determines voxel spacing from
PixelSpacingandSliceThicknessDICOM tags (falling back to median z-diff if needed). - Resamples each volume to 1.0 × 1.0 × 1.0 mm isotropic spacing using
scipy.ndimage.zoomwith trilinear interpolation (order=1). - Applies HU windowing: clips to
[−160, 240]HU and normalises to[0, 1]. - Saves the full 3D volume as a single
.npyfile and also saves individual 2D axial slices as separate.npyfiles per subject. - Implements resumability: skips already-processed subjects and saves a progress CSV after each subject.
- Processes 223 subjects in the run shown, taking approximately 3 hours 38 minutes on CPU.
Key configuration:
TARGET_SPACING = [1.0, 1.0, 1.0] mm
HU_WINDOW = [−160, 240]
Resampling order = 1 (trilinear)
Key outputs:
output_3d_volumes/<subject_id>.npy— one 3D float32 volume per subjectoutput_2d_slices/<subject_id>/slice_NNN.npy— per-slice 2D arrays- Updated CSV with
path_image_3d_npyandpath_slices_dir_npycolumns
File: Segmentation_masks_generation.ipynb
Converts RTSTRUCT DICOM files and NIfTI label files into 3D binary NumPy mask arrays aligned to the CT volumes.
What it does:
- Reads
merged_seg_with_paths.csvand filters rows whereAnnotation Type == "Segmentation". - For NIfTI inputs: loads via
nibabel, thresholds float arrays at > 0, and returns a uint8 binary mask. - For RTSTRUCT inputs:
- Resolves the RTSTRUCT
.dcmfile (either direct path or by scanning a folder forModality == RTSTRUCT). - Loads the referenced CT DICOM series and sorts slices by
ImagePositionPatientz. - Rasterises contour polygons from
ROIContourSequenceinto a 3D binary mask by mapping patient (x, y, z) coordinates to pixel indices using the CT'sImagePositionPatient,ImageOrientationPatient, andPixelSpacingDICOM tags, then usingskimage.draw.polygon. - Matches each contour's median z-coordinate to the nearest CT slice (within a 1.0 mm tolerance).
- Resolves the RTSTRUCT
- Implements a search utility to locate a DICOM series folder by
SeriesInstanceUID, scanning base directories with both fast name-match and full DICOM-header fallback. - Resumable: updates and saves the CSV after each successfully generated mask. Rows with errors are skipped and can be retried.
- Masks are saved as
.npyfiles and themask_pathcolumn is populated in the CSV.
Key outputs:
segmentation_masks/<subject_id>_<series_id>.npy— 3D binary uint8 mask arrays- Updated
merged_seg_with_paths.csvwithmask_pathcolumn
File: Baseline_model_final.ipynb
A TensorFlow/Keras-based 3D CNN trained as a baseline classifier, with optional dual-channel (CT + mask) input and Grad-CAM visualisation.
What it does:
Training:
- Reads the merged CSV and dynamically determines whether segmentation masks are available; sets
CHANNELS = 2(image + mask) if masks are found on disk, otherwiseCHANNELS = 1. - Resizes volumes to
128 × 128 × 128using scipy zoom (order=1 for image, order=0 for mask), then normalises image to[0, 1]. - Applies stratified 70/15/15 train/val/test split.
- Builds a small 3D CNN: three Conv3D–BatchNorm–MaxPool3D blocks (16→32→64 filters), GlobalAveragePooling3D, Dense(128) + Dropout(0.4), sigmoid output.
- Trains with
binary_crossentropy,Adam, balanced class weights, and callbacks for model checkpointing (best val AUC), ReduceLROnPlateau, EarlyStopping (patience=8), and a customHistorySaverthat appends epoch metrics to a CSV after every epoch. - Supports resume from checkpoint: reads the saved model and history CSV to continue training from the last completed epoch.
- Saves final model as
.keras, test predictions, and training history plot (Loss + AUC vs Epoch).
Grad-CAM:
- Two versions are implemented. The first loads a single-channel model and generates Grad-CAM using
GradientTapeon the last Conv3D layer; the second handles the dual-channel model with both CT and mask channels. - Upsamples the 3D heatmap back to the original volume resolution, overlays it on 6 axial slices, and saves the result as a PNG.
Key outputs:
best_model.keras/final_trained_model.kerasroc_curve.png,training_history_plot.pngtest_predictions.csvgradcam_visuals/gradcam_<subject>.png
File: 3D_ResNet.ipynb
A PyTorch/MONAI-based 3D ResNet-10 classifier that takes dual-channel (CT + mask) input, trained CPU-safely with checkpointing and early stopping.
What it does:
Dataset:
- Loads image
.npyand mask.npyfor each subject. If a mask is unavailable, a zero-tensor is used as the mask channel. - Clips CT to
[−1000, 1000]HU, scales to[0, 1], and resizes to64 × 128 × 128(D × H × W) usingscipy.ndimage.zoom. - Constructs a 2-channel tensor
[image, mask]of shape(2, D, H, W). - Training augmentations: joint random 90° rotations and depth-axis flips on image and mask, random multiplicative/additive intensity scaling, and small Gaussian noise injection.
Model:
- MONAI
resnet10withn_input_channels=2andnum_classes=1. - All backbone convolutional layers are frozen; only the final fully-connected head is trained, making this viable on CPU.
- Loss:
BCEWithLogitsLosswith class-frequency-based positive weight. - Optimiser:
AdamW(lr=1e-4, weight_decay=1e-5). - 70/15/15 stratified split; trains for up to 50 epochs with patience=10 early stopping on validation AUC.
- Checkpoint saved every epoch; best model saved on AUC improvement.
Evaluation and plots:
- Computes test AUC, saves loss curve, validation AUC curve, ROC curve, and confusion matrix as PNG files.
Grad-CAM (within the same notebook):
- Loads the best saved checkpoint back into MONAI ResNet-10.
- Hooks the last
Conv3dlayer to capture activations and gradients. - For each randomly selected sample, generates a 3D Grad-CAM heatmap (global-average-pooled gradient weighting), saves a 3-panel figure: CT mid-axial slice, segmentation mask, and CT with Grad-CAM overlay.
Key outputs:
checkpoint_cpu_masks.pth,best_model_cpu_masks.pthloss_curve.png,val_auc_curve.png,roc_curve.png,confusion_matrix.pnggradcam_outputs/<subject>_gradcam.png
File: ResNet_Grad_CAM.ipynb
A standalone, improved Grad-CAM visualisation notebook for the trained MONAI ResNet-18 checkpoint, producing smooth, upsampled heatmaps with mask contour overlays.
What it does:
- Forces CPU-only mode and disables GPU/TensorFlow to avoid CUDA conflicts in Colab.
- Loads the MONAI ResNet-18 checkpoint (saved as state dict or dict-with-state-dict) using
strict=False. - Implements a
GradCAM3Dclass that:- Hooks
model.layer4[-1].conv2(last conv of the deepest ResNet block) for activations and gradients. - Computes channel weights by global-average pooling of gradients.
- Upsamples the resulting 3D heatmap to the original input resolution using trilinear interpolation (
F.interpolate). - Applies volumetric Gaussian smoothing (sigma=1.0) to reduce blockiness.
- Hooks
- Downsamples input volumes to a maximum of 128 on any axis to manage CPU memory.
- Resamples segmentation masks to match image shape using nearest-neighbour zoom with shape correction.
- For each of 3 randomly selected samples, generates a 3-column figure:
- Original CT mid-axial slice
- CT with mask overlay (semi-transparent red)
- CT with smooth Grad-CAM heatmap overlay (jet colourmap) + white segmentation contour drawn using
skimage.measure.find_contours
- Adds a colour bar to the heatmap panel.
Key outputs:
- Multi-panel PNG per subject showing original CT, mask overlay, and upsampled Grad-CAM
File: XGB____LR_Fusion_pipeline.ipynb
The final classification and ensemble fusion stage, combining radiomics features with out-of-fold (OOF) deep ResNet features, applying scanner harmonisation, and evaluating multiple classifier variants.
What it does:
Step 1 — ROI filtering:
- Loads the full radiomics CSV and normalises ROI names to canonical tokens (
PANCREAS-1,PANCREAS-V2,PANCREATIC DUCT,NO FINDINGS). - Selects one row per subject by ROI priority; falls back to first occurrence if no canonical ROI is present.
- Saves a filtered one-row-per-subject radiomics CSV.
Step 2 — OOF Deep Feature Extraction:
- Loads the filtered radiomics CSV to obtain per-subject
series_uid_usedand labels. - Performs 5-fold stratified cross-validation: for each fold, trains a MONAI
resnet18from scratch (or from an external checkpoint) on the training subjects, then extracts mid-layer features (fromlayer3orlayer2, followed byAdaptiveAvgPool3d) for the held-out subjects. - This produces out-of-fold deep features that are label-leakage free.
- Merges OOF deep features with the radiomics CSV to create a combined fusion feature table.
Step 3 — Scanner Harmonisation:
- Removes shape, diagnostic, and metadata columns from the feature table.
- Applies ComBat harmonisation via
neuroHarmonizeusingSITE(dataset origin: CPTAC-PDA vs Pancreas-CT) as the batch covariate, correcting for scanner-driven distributional differences.
Step 4 — Ensemble Fusion (three classifier variants):
Three separate fusion pipelines are implemented and evaluated independently:
| Variant | Base learner |
|---|---|
| XGB fusion | XGBoost |
| RF fusion | Random Forest |
| LR fusion | Logistic Regression |
Each pipeline:
- Splits data 80/20 (stratified) into train and held-out test sets.
- Uses an inner 5-fold OOF stacking loop on the training set to train separate radiomics and deep feature pipelines (
StandardScaler → PCA(95% variance) → base learner). - Trains a
LogisticRegressionmeta-learner on the OOF probability outputs from both branches. - Evaluates four fusion strategies: radiomics-only, deep-only, average fusion, and stacked meta-learner.
- Runs 5-fold nested outer CV for robust, unbiased AUC estimates.
- Runs 30 repeated random splits and records AUC per repeat.
- Computes bootstrap AUC confidence intervals (2000 resamples) and paired bootstrap significance tests comparing average fusion to each individual modality.
- Maps SHAP values (or LR coefficients) from PCA space back to original radiomics features for interpretability; tracks feature stability across repeats.
- Generates calibration curves, sensitivity/specificity with Clopper-Pearson CIs, confusion matrices, and ROC plots.
- Runs a data leakage check: detects duplicate rows, zero-variance features, ID column candidates, and feature vector contradictions.
Key outputs (per classifier variant):
stacked_pipelines_models.joblibmain_run_auc_bootstrap_ci.csvnested_outer_cv_results.csv,nested_outer_cv_summary.csvrepeats_stacking_results.csv,repeats_stacking_summary_stats.csvpaired_bootstrap_avg_vs_baselines_repeats.csvfeature_stability_top10_counts.csvradiomics_feature_importance_mapped_from_shap_main.csv- ROC, calibration, and confusion matrix PNG files
leakage_report.json
pip install pydicom rt_utils nibabel SimpleITK scikit-image
pip install monai torch torchvision
pip install tensorflow keras
pip install pandas numpy scipy tqdm matplotlib
pip install pyradiomics
pip install xgboost scikit-learn shap joblib
pip install neuroHarmonize neuroCombat
pip install tcia_utils altair openpyxlAll notebooks are designed to run in Google Colab with Google Drive mounted at /content/drive/MyDrive/TCIA_Data/.
TCIA_Data/
├── CPTAC-PDA/ # Downloaded CPTAC-PDA DICOM series
├── CPTAC_Negative/ # CPTAC-PDA negative assessment DICOMs
├── CPTAC-PDA_Segementation/ # RTSTRUCT annotation DICOMs
├── Healthy_Pancreas_CT/ # Pancreas-CT DICOM series
├── NIfTI_Uncompressed/ # Decompressed NIfTI label files
├── output_3d_volumes/ # Per-subject 3D .npy CT volumes
├── output_2d_slices/ # Per-subject 2D slice .npy arrays
├── segmentation_masks/ # Generated binary mask .npy arrays
├── radiomics_output/ # Raw radiomics feature CSVs
├── pancreas_training_with_masks/ # ResNet training outputs + checkpoints
│ └── gradcam_outputs/ # Grad-CAM visualisation PNGs
├── outputs_3dcnn_2/ # Baseline CNN outputs + Grad-CAM visuals
├── New_XGB_fusion_results/ # XGBoost fusion artefacts
├── New_RF_fusion_results/ # Random Forest fusion artefacts
├── New_LR_fusion_results/ # Logistic Regression fusion artefacts
├── CPTAC_PDA_merged.csv # Master case/control metadata CSV
├── merged_seg_with_paths.csv # Segmentation paths + subject metadata
├── fusion_radiomics_filtered.csv # One-row-per-subject filtered radiomics
└── oof_middle_layer_fusion_features.csv # Combined radiomics + OOF deep features
| Artefact | Description |
|---|---|
best_model_cpu_masks.pth |
Best MONAI ResNet-10 checkpoint (by val AUC) |
final_trained_model.keras |
Best Keras baseline 3D CNN |
radiomics_features.csv |
Per-subject per-ROI radiomics features |
oof_middle_layer_fusion_features.csv |
Leakage-free deep + radiomics features |
main_run_auc_bootstrap_ci.csv |
AUC with 95% CI for all four fusion models |
nested_outer_cv_summary.csv |
Mean ± std AUC across outer CV folds |
feature_stability_top10_counts.csv |
Radiomics feature stability across repeats |
*_gradcam.png |
Grad-CAM overlays with mask contours |
leakage_report.json |
Data integrity and leakage check summary |
💡 All notebooks were developed and executed in Google Colab (NVIDIA Tesla T4/V100, 12–24 GB VRAM).
This project uses two publicly available datasets from The Cancer Imaging Archive (TCIA):
| Dataset | Cohort | Cases | Label | Notes |
|---|---|---|---|---|
| CPTAC-PDA | Tumour | ~103 subjects | PDAC-positive | Multi-vendor CT; RTSTRUCT annotations |
| Pancreas-CT | Control | 80 subjects | PDAC-negative | NIH Clinical Center; NIfTI annotations |
CT Characteristics:
- Pancreas-CT: Siemens/Philips MDCT at 120 kVp, 512×512 matrix, portal venous phase (~70s post-contrast), 1.5–2.5 mm slice thickness.
- CPTAC-PDA: Variable acquisition parameters, multi-vendor, contrast-enhanced and non-contrast studies included.
Label Distribution:
- CPTAC-PDA: 53M / 27F, mean age 46.8 ± 16.7 years (controls); tumour cohort mean age ~65 years.
- Binary classification: PDAC (1) vs. healthy pancreas (0).
pip install pydicom nibabel SimpleITK numpy scipy scikit-image
pip install pyradiomics neuroHarmonize
pip install monai torch torchvision
pip install scikit-learn xgboost shap
pip install matplotlib seaborn pandas| Library | Purpose |
|---|---|
pydicom, SimpleITK |
DICOM loading and image processing |
nibabel |
NIfTI segmentation file handling |
scipy, scikit-image |
Resampling, rasterisation |
PyRadiomics |
Handcrafted radiomics feature extraction |
NeuroHarmonize |
ComBat scanner harmonisation |
MONAI, PyTorch |
3D ResNet-10, volumetric deep learning |
scikit-learn |
PCA, Logistic Regression, cross-validation |
XGBoost |
Non-linear ensemble classification |
SHAP |
Radiomics explainability |
Grad-CAM (custom) |
Deep learning spatial explainability |
matplotlib, seaborn |
Visualisation |
All experiments were run on Google Colab with GPU acceleration (Tesla T4 / V100). No local GPU is required.
Anoushka Tomar
📧 anoushkatomar30@gmail.com
🔗 ORCID: 0009-0009-2676-0204
© 2026 Anoushka. Licensed under Apache 2.0.
This research uses publicly available datasets from The Cancer Imaging Archive (TCIA). We gratefully acknowledge CPTAC and the NIH for curating and maintaining these resources.