Skip to content

Add support for real-world datasets (Kaggle IEEE-CIS, Credit Card Fraud) #1

Description

@stabrea

Summary

The current model trains on synthetic or limited sample data. To validate real-world performance and establish honest benchmarks, we should integrate well-known public fraud detection datasets.

Motivation

  • Real-world fraud data has very different class imbalance ratios (~0.17% fraud in IEEE-CIS)
  • Benchmarking against known datasets allows comparison with published research
  • Exposes edge cases that synthetic data misses (merchant category patterns, time-of-day effects)

Proposed Approach

  1. Add data loader modules for:

  2. Preprocessing pipeline:

    • Handle missing values and categorical encoding for IEEE-CIS
    • Implement stratified train/test splitting preserving fraud ratio
    • Add feature engineering for temporal and aggregation features
  3. Benchmarking:

    • Report precision, recall, F1, and AUC-PR (not just AUC-ROC, which is misleading with class imbalance)
    • Compare against published baselines
    • Document results in a benchmarks/ directory

Acceptance Criteria

  • Data loaders for both datasets with automatic download/caching
  • Preprocessing handles missing values and encoding
  • Benchmark results documented with honest metrics
  • README updated with dataset instructions and results table

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions