Exploratory analysis and visualization of a breast cancer histopathology image dataset, including binary and multiclass classification structures across multiple magnification levels.
This repository provides an exploratory analysis and structured overview of a breast cancer histopathology image dataset. The goal is to clearly document the dataset organization, classification tasks, and image characteristics before applying machine learning or deep learning models.
The dataset contains microscopic images of breast tissue acquired at multiple magnification levels and supports both binary and multiclass classification problems.
The dataset is organized into two main classification tasks:
Objective: Distinguish between benign and malignant breast tissue samples. classificacao_binaria/ ├── 40X/ │ ├── benign/ │ └── malignant/ ├── 100X/ ├── 200X/ └── 400X/ Classes:
benignmalignant
Objective: Classify breast tissue into specific histopathological tumor subtypes. classificacao_multiclasse/ ├── 40X/ │ ├── adenosis │ ├── ductal_carcinoma │ ├── fibroadenoma │ ├── lobular_carcinoma │ ├── mucinous_carcinoma │ ├── papillary_carcinoma │ ├── phyllodes_tumor │ └── tubular_adenoma ├── 100X/ ├── 200X/ └── 400X/ Classes (8 total):
- Adenosis
- Ductal Carcinoma
- Fibroadenoma
- Lobular Carcinoma
- Mucinous Carcinoma
- Papillary Carcinoma
- Phyllodes Tumor
- Tubular Adenoma
Images are provided at four different magnifications:
- 40X – Low magnification (global tissue structure)
- 100X – Intermediate magnification
- 200X – Higher cellular detail
- 400X – Fine-grained cellular morphology
This allows analysis of how magnification impacts visual features and model performance.
To facilitate qualitative inspection, this repository includes a visualization script that:
- Randomly selects representative images from each class
- Covers both binary and multiclass tasks
- Displays samples across different magnification levels
This step is crucial for:
- Understanding intra-class variability
- Identifying visual differences between classes
- Verifying dataset integrity before training models
This repository focuses only on dataset understanding, including:
- Folder hierarchy documentation
- Class definitions
- Magnification levels
- Sample image visualization
Future repositories may build upon this dataset for classification, segmentation, or
representation learning tasks.
- Python
- OS (file system handling)
- Pillow (PIL)
- Matplotlib
- Images are organized in a format compatible with most deep learning frameworks (e.g., PyTorch, TensorFlow).
- The dataset is suitable for benchmarking binary vs. multiclass classification approaches.
Possible extensions include:
- Dataset statistics (image counts per class and magnification)
- Train/validation/test splitting
- Baseline deep learning models
- Cross-magnification experiments Data: https://data.mendeley.com/datasets/jxwvdwhpc2/1