Skip to content

priscillalea/spam-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📨 Spam Detector — Weekend ML #2

The third step in my weekend ML “trilogy.” After building a K‑Drama Recommender and an F1 Winner Predictor, I wanted to understand how email spam filters work and build one myself.

status python tooling

Table of Contents


Motivation & Story

I’m deeply interested in Data Science and Machine Learning, especially after starting the Oracle University courses. So far, I’ve already built a K‑Drama Recommender (to find similar titles I love) and an F1 Winner Predictor (now being improved with more datasets). Therefore, when I learned that email spam boxes are powered by ML, I got excited to see how it works end‑to‑end and to create my own filter. This repo is the result: a clean, reproducible text‑classification pipeline that turns raw emails into a binary decision: SPAM or Not SPAM.

Dataset: I used the Spam Mails Dataset by venky73 on Kaggle: https://www.kaggle.com/datasets/venky73/spam-mails-dataset
Please check the dataset page for license terms before (re)distributing the data.


What This Project Does

  • Cleans and preprocesses email text (lowercase, symbols removal, stopwords, stemming).
  • Converts text to numeric features using TF‑IDF.
  • Trains a Multinomial Naive Bayes classifier.
  • Evaluates with accuracy and classification_report (precision/recall/F1).
  • Persists artifacts with joblib.dump:
    spam_detector_model.pkl (model) and tfidf_vectorizer.pkl (vectorizer).
  • Provides a simple prediction function predict_new_email(text) to classify new emails using the saved artifacts.

Project Structure

.
├── main.py                  # training + evaluation + quick tests
├── spam_ham_dataset.csv     # local dataset (see Kaggle link above for license)
├── spam_detector_model.pkl  # trained model (generated by main.py)
└── tfidf_vectorizer.pkl     # TF‑IDF vectorizer (generated by main.py)

Getting Started

Requirements

  • Python 3.10+ (also tested on 3.13)
  • pip

1) Clone & enter the project

git clone <https://github.com/priscillalea/spam-detector.git>.git
cd <https://github.com/priscillalea/spam-detector.git>

2) (Optional) Create a virtual environment

python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/macOS
source .venv/bin/activate

3) Install dependencies

pip install -U pandas scikit-learn nltk joblib

The script downloads NLTK stopwords on first run.

4) Place the dataset spam_ham_dataset.csv in the project root.

5) Train & evaluate

python main.py

This will print the TF‑IDF shape, accuracy and a classification report, and it will save two files: spam_detector_model.pkl and tfidf_vectorizer.pkl.


Usage

A) Use the helper function (recommended)

from main import predict_new_email

print(predict_new_email("Congratulations! You've won a free prize — click here!"))
print(predict_new_email("Hi team, please find the updated report attached. Thanks!"))

predict_new_email loads the saved model and vectorizer, applies the same preprocessing and returns "SPAM" or "Not SPAM".

B) Load artifacts manually

import joblib
from main import preprocess_text

model = joblib.load("spam_detector_model.pkl")
vectorizer = joblib.load("tfidf_vectorizer.pkl")

X = vectorizer.transform([preprocess_text("URGENT! Your account was compromised. Log in now!")])
print("SPAM" if model.predict(X)[0] == 1 else "Not SPAM")

How It Works

  1. Label encoding
    The dataset’s label column (spam/ham) is mapped to numeric label_num (1/0).

  2. Preprocessing

    • lowercasing
    • removing punctuation and digits
    • removing English stopwords (NLTK)
    • stemming with PorterStemmer
  3. Vectorization
    TfidfVectorizer is fit on the preprocessed text and transforms each message into a sparse vector.

  4. Train/test split
    75/25 split with random_state=42 for reproducibility.

  5. Model
    MultinomialNB, a strong baseline for text classification.

  6. Evaluation
    accuracy_score and classification_report (per‑class precision/recall/F1).

  7. Persistence
    joblib.dump stores the trained model and vectorizer for later use; the prediction helper reloads them on demand.


Results (Baseline)

Your exact numbers will vary depending on random splits and preprocessing. The goal here is a clean, explainable baseline that you can iterate on. Check the console output after python main.py for the full classification report.


Roadmap

  • Replace stemming with lemmatization and compare.
  • Try n‑grams in TF‑IDF (e.g., (1,2) and (1,3)).
  • Address class imbalance (oversampling or class weights).
  • Compare models: Logistic Regression, Linear SVM, Complement NB.
  • Add cross‑validation and hyperparameter tuning.
  • Calibrate probabilities and adjust decision thresholds (optimize Recall for the spam class).
  • Expose a small API (Flask/FastAPI) or CLI.
  • Add a notebook for EDA and error analysis.
  • Publish a short Model Card (limitations, risks, bias).

How This Fits My Learning Path

  • K‑Drama Recommender → text representation & semantic similarity.
  • F1 Winner Predictor → pipelines on tabular/time‑based data, metrics and validation.
  • Spam Detectorsupervised NLP with a classic TF‑IDF + NB pipeline.

Together, these weekend projects show my progression in collecting, cleaning, modeling, evaluating, and — most importantly — telling the story behind the code.


FAQ

Is this production‑ready?
No — this is an educational baseline. A production system would require stronger evaluation, monitoring, security, privacy, and continuous retraining strategies.

Can I use another language (e.g., Portuguese emails)?
Yes, but you should adapt stopwords and potentially the preprocessing/tokenization accordingly.

Where do I ask questions or report issues?
Open an issue in the repository. I’m happy to improve this project with feedback!


Contributing

Issues and pull requests are welcome. If you plan a larger contribution, please open an issue first to discuss scope and direction.


License

Code is released under the MIT License
The dataset is not included; please follow the Kaggle license and usage terms on the dataset page linked above.


Acknowledgments

  • Oracle University for sparking this learning path.
  • Kaggle and venky73 for the dataset.
  • The NLP/ML community for the countless references and ideas.

Contact

Priscilla LeandroLinkedIn · GitHub

About

A clean, reproducible text‑classification pipeline that turns raw emails into a binary decision: SPAM or Not SPAM.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages