📨 Spam Detector — Weekend ML #2

The third step in my weekend ML “trilogy.” After building a K‑Drama Recommender and an F1 Winner Predictor, I wanted to understand how email spam filters work and build one myself.

Motivation & Story

I’m deeply interested in Data Science and Machine Learning, especially after starting the Oracle University courses. So far, I’ve already built a K‑Drama Recommender (to find similar titles I love) and an F1 Winner Predictor (now being improved with more datasets). Therefore, when I learned that email spam boxes are powered by ML, I got excited to see how it works end‑to‑end and to create my own filter. This repo is the result: a clean, reproducible text‑classification pipeline that turns raw emails into a binary decision: SPAM or Not SPAM.

Dataset: I used the Spam Mails Dataset by venky73 on Kaggle: https://www.kaggle.com/datasets/venky73/spam-mails-dataset
Please check the dataset page for license terms before (re)distributing the data.

What This Project Does

Cleans and preprocesses email text (lowercase, symbols removal, stopwords, stemming).
Converts text to numeric features using TF‑IDF.
Trains a Multinomial Naive Bayes classifier.
Evaluates with accuracy and classification_report (precision/recall/F1).
Persists artifacts with joblib.dump:
spam_detector_model.pkl (model) and tfidf_vectorizer.pkl (vectorizer).
Provides a simple prediction function predict_new_email(text) to classify new emails using the saved artifacts.

Project Structure

.
├── main.py                  # training + evaluation + quick tests
├── spam_ham_dataset.csv     # local dataset (see Kaggle link above for license)
├── spam_detector_model.pkl  # trained model (generated by main.py)
└── tfidf_vectorizer.pkl     # TF‑IDF vectorizer (generated by main.py)

Getting Started

Requirements

Python 3.10+ (also tested on 3.13)
pip

1) Clone & enter the project

git clone <https://github.com/priscillalea/spam-detector.git>.git
cd <https://github.com/priscillalea/spam-detector.git>

2) (Optional) Create a virtual environment

python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/macOS
source .venv/bin/activate

3) Install dependencies

pip install -U pandas scikit-learn nltk joblib

The script downloads NLTK stopwords on first run.

4) Place the dataset spam_ham_dataset.csv in the project root.

5) Train & evaluate

python main.py

This will print the TF‑IDF shape, accuracy and a classification report, and it will save two files: spam_detector_model.pkl and tfidf_vectorizer.pkl.

Usage

A) Use the helper function (recommended)

from main import predict_new_email

print(predict_new_email("Congratulations! You've won a free prize — click here!"))
print(predict_new_email("Hi team, please find the updated report attached. Thanks!"))

predict_new_email loads the saved model and vectorizer, applies the same preprocessing and returns "SPAM" or "Not SPAM".

B) Load artifacts manually

import joblib
from main import preprocess_text

model = joblib.load("spam_detector_model.pkl")
vectorizer = joblib.load("tfidf_vectorizer.pkl")

X = vectorizer.transform([preprocess_text("URGENT! Your account was compromised. Log in now!")])
print("SPAM" if model.predict(X)[0] == 1 else "Not SPAM")

How It Works

Label encoding
The dataset’s label column (spam/ham) is mapped to numeric label_num (1/0).
Preprocessing
- lowercasing
- removing punctuation and digits
- removing English stopwords (NLTK)
- stemming with PorterStemmer
Vectorization
TfidfVectorizer is fit on the preprocessed text and transforms each message into a sparse vector.
Train/test split
75/25 split with random_state=42 for reproducibility.
Model
MultinomialNB, a strong baseline for text classification.
Evaluation
accuracy_score and classification_report (per‑class precision/recall/F1).
Persistence
joblib.dump stores the trained model and vectorizer for later use; the prediction helper reloads them on demand.

Results (Baseline)

Your exact numbers will vary depending on random splits and preprocessing. The goal here is a clean, explainable baseline that you can iterate on. Check the console output after python main.py for the full classification report.

Roadmap

Replace stemming with lemmatization and compare.
Try n‑grams in TF‑IDF (e.g., (1,2) and (1,3)).
Address class imbalance (oversampling or class weights).
Compare models: Logistic Regression, Linear SVM, Complement NB.
Add cross‑validation and hyperparameter tuning.
Calibrate probabilities and adjust decision thresholds (optimize Recall for the spam class).
Expose a small API (Flask/FastAPI) or CLI.
Add a notebook for EDA and error analysis.
Publish a short Model Card (limitations, risks, bias).

How This Fits My Learning Path

K‑Drama Recommender → text representation & semantic similarity.
F1 Winner Predictor → pipelines on tabular/time‑based data, metrics and validation.
Spam Detector → supervised NLP with a classic TF‑IDF + NB pipeline.

Together, these weekend projects show my progression in collecting, cleaning, modeling, evaluating, and — most importantly — telling the story behind the code.

FAQ

Is this production‑ready?
No — this is an educational baseline. A production system would require stronger evaluation, monitoring, security, privacy, and continuous retraining strategies.

Can I use another language (e.g., Portuguese emails)?
Yes, but you should adapt stopwords and potentially the preprocessing/tokenization accordingly.

Where do I ask questions or report issues?
Open an issue in the repository. I’m happy to improve this project with feedback!

Contributing

Issues and pull requests are welcome. If you plan a larger contribution, please open an issue first to discuss scope and direction.

License

Code is released under the MIT License
The dataset is not included; please follow the Kaggle license and usage terms on the dataset page linked above.

Acknowledgments

Oracle University for sparking this learning path.
Kaggle and venky73 for the dataset.
The NLP/ML community for the countless references and ideas.

Contact

Priscilla Leandro — LinkedIn · GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📨 Spam Detector — Weekend ML #2

Table of Contents

Motivation & Story

What This Project Does

Project Structure

Getting Started

Usage

A) Use the helper function (recommended)

B) Load artifacts manually

How It Works

Results (Baseline)

Roadmap

How This Fits My Learning Path

FAQ

Contributing

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
main.py		main.py
spam_detector_model.pkl		spam_detector_model.pkl
spam_ham_dataset.csv		spam_ham_dataset.csv
tfidf_vectorizer.pkl		tfidf_vectorizer.pkl

Folders and files

Latest commit

History

Repository files navigation

📨 Spam Detector — Weekend ML #2

Table of Contents

Motivation & Story

What This Project Does

Project Structure

Getting Started

Usage

A) Use the helper function (recommended)

B) Load artifacts manually

How It Works

Results (Baseline)

Roadmap

How This Fits My Learning Path

FAQ

Contributing

License

Acknowledgments

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages