A production-ready spam detection web application extending the research paper:
"Machine Learning-Based Email Spam Detection: Accuracy, Overfitting and Robustness Analysis" Published in EJASET, Volume 3, Issue 6, 2025 👉 https://doi.org/10.59324/ejaset.2025.3(6).06 Built by Ghulam Muhayyudin — Computer Science Undergraduate Researcher
👉 https://email-spam-detector-1003.streamlit.app
This system classifies emails as spam or ham (legitimate) using:
- TF-IDF vectorisation (5,000 features + bigrams)
- Random Forest — Best overall model (Accuracy: 98.21%, F1: 98.11%)
- Logistic Regression CV — Strong alternative (Accuracy: 97.84%, F1: 97.73%)
- Tesseract OCR — Detects text hidden inside image attachments (multimodal spam)
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| ★ Random Forest | 98.21% | 98.14% | 98.08% | 98.11% |
| Logistic Regression CV | 97.84% | 97.34% | 98.12% | 97.73% |
★ Best model on this dataset by all metrics. Both models trained on 70/30 stratified split with 5-fold cross-validation.
Email-Spam-Detector/
├── app.py ← Streamlit web application
├── train.py ← Model training script
├── requirements.txt ← Python dependencies
├── packages.txt ← System packages (Tesseract OCR)
├── data/
│ └── dataset.csv ← Email dataset (spam/ham) | Read the Dataset note below to download Full dataset.
├── models/ ← Saved model files (generated after training)
└── src/
├── preprocessor.py ← Text cleaning pipeline
├── trainer.py ← Model definitions and persistence
├── evaluator.py ← Metrics and evaluation reports
└── ocr_extractor.py ← Image OCR for multimodal spam
Note: The dataset included in this repository is a reduced version (5,572 emails — UCI SMS Spam Collection) due to GitHub's 100MB file size restriction. The full research dataset used in the original paper contains significantly more samples and cannot be hosted directly on GitHub.
| Version | Samples | Source | Usage |
|---|---|---|---|
| ✅ GitHub (included) | 5,572 emails | UCI SMS Spam Collection | Auto-downloaded on first app launch |
| 📦 Full Dataset | Large-scale | Original research dataset | For full replication of paper results |
⬇️ Download Full Dataset:
👉 https://drive.google.com/file/d/1THnJXB7qnWkphDIc9z0L2TgMGX240XY1/view?usp=sharing
## ⚙️ How to Run Locally
### 1. Clone the repository
```bash
git clone https://github.com/ghulammuhayyudin1003/Email-Spam-Detector.git
cd Email-Spam-Detector
pip install -r requirements.txt- Windows: Download installer
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
python train.pystreamlit run app.pyOpen http://localhost:8501 in your browser.
✅ Live on Streamlit Community Cloud (free) 🔗 https://email-spam-detector-1003.streamlit.app
requirements.txt→ auto pip installs all Python packagespackages.txt→ auto installstesseract-ocrsystem binary- Dataset auto-downloads on first launch (UCI SMS Spam Collection)
- Models auto-train on first launch (~2-3 minutes)
This project is the practical implementation of the paper's key findings:
- Why Random Forest? Ensemble of 300 trees with
min_samples_leaf=2produces the lowest variance across folds (std=0.0058), confirming the paper's robustness analysis. - Why Logistic Regression CV? Built-in cross-validated regularisation eliminates manual hyperparameter tuning. Strong alternative to Random Forest with 97.84% accuracy.
- Why TF-IDF over word embeddings? The paper confirmed TF-IDF achieves near-identical accuracy to more complex representations on this domain, with far lower inference cost.
| Tool | Purpose |
|---|---|
| Python 3.11 | Core language |
| scikit-learn | ML models + TF-IDF |
| NLTK | Text preprocessing |
| Tesseract + pytesseract | OCR for image spam |
| Streamlit | Web interface |
| joblib | Model serialisation |
| pandas | Data handling |
Ghulam Muhayyu Din Computer Science Undergraduate GitHub: @ghulammuhayyudin1003 Google Scholar: (https://scholar.google.com/citations?user=2H5SwVkAAAAJ&hl=en&authuser=2&oi=ao)