The third step in my weekend ML “trilogy.” After building a K‑Drama Recommender and an F1 Winner Predictor, I wanted to understand how email spam filters work and build one myself.
- Motivation & Story
- What This Project Does
- Project Structure
- Getting Started
- Usage
- How It Works
- Results (Baseline)
- Roadmap
- How This Fits My Learning Path
- FAQ
- Contributing
- License
- Acknowledgments
- Contact
I’m deeply interested in Data Science and Machine Learning, especially after starting the Oracle University courses. So far, I’ve already built a K‑Drama Recommender (to find similar titles I love) and an F1 Winner Predictor (now being improved with more datasets). Therefore, when I learned that email spam boxes are powered by ML, I got excited to see how it works end‑to‑end and to create my own filter. This repo is the result: a clean, reproducible text‑classification pipeline that turns raw emails into a binary decision: SPAM or Not SPAM.
Dataset: I used the Spam Mails Dataset by venky73 on Kaggle: https://www.kaggle.com/datasets/venky73/spam-mails-dataset
Please check the dataset page for license terms before (re)distributing the data.
- Cleans and preprocesses email text (lowercase, symbols removal, stopwords, stemming).
- Converts text to numeric features using TF‑IDF.
- Trains a Multinomial Naive Bayes classifier.
- Evaluates with accuracy and classification_report (precision/recall/F1).
- Persists artifacts with
joblib.dump:
spam_detector_model.pkl(model) andtfidf_vectorizer.pkl(vectorizer). - Provides a simple prediction function
predict_new_email(text)to classify new emails using the saved artifacts.
.
├── main.py # training + evaluation + quick tests
├── spam_ham_dataset.csv # local dataset (see Kaggle link above for license)
├── spam_detector_model.pkl # trained model (generated by main.py)
└── tfidf_vectorizer.pkl # TF‑IDF vectorizer (generated by main.py)
Requirements
- Python 3.10+ (also tested on 3.13)
pip
1) Clone & enter the project
git clone <https://github.com/priscillalea/spam-detector.git>.git
cd <https://github.com/priscillalea/spam-detector.git>2) (Optional) Create a virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/macOS
source .venv/bin/activate3) Install dependencies
pip install -U pandas scikit-learn nltk joblibThe script downloads NLTK stopwords on first run.
4) Place the dataset spam_ham_dataset.csv in the project root.
5) Train & evaluate
python main.pyThis will print the TF‑IDF shape, accuracy and a classification report, and it will save two files: spam_detector_model.pkl and tfidf_vectorizer.pkl.
from main import predict_new_email
print(predict_new_email("Congratulations! You've won a free prize — click here!"))
print(predict_new_email("Hi team, please find the updated report attached. Thanks!"))predict_new_email loads the saved model and vectorizer, applies the same preprocessing and returns "SPAM" or "Not SPAM".
import joblib
from main import preprocess_text
model = joblib.load("spam_detector_model.pkl")
vectorizer = joblib.load("tfidf_vectorizer.pkl")
X = vectorizer.transform([preprocess_text("URGENT! Your account was compromised. Log in now!")])
print("SPAM" if model.predict(X)[0] == 1 else "Not SPAM")-
Label encoding
The dataset’slabelcolumn (spam/ham) is mapped to numericlabel_num(1/0). -
Preprocessing
- lowercasing
- removing punctuation and digits
- removing English stopwords (NLTK)
- stemming with PorterStemmer
-
Vectorization
TfidfVectorizer is fit on the preprocessed text and transforms each message into a sparse vector. -
Train/test split
75/25 split withrandom_state=42for reproducibility. -
Model
MultinomialNB, a strong baseline for text classification. -
Evaluation
accuracy_scoreandclassification_report(per‑class precision/recall/F1). -
Persistence
joblib.dumpstores the trained model and vectorizer for later use; the prediction helper reloads them on demand.
Your exact numbers will vary depending on random splits and preprocessing. The goal here is a clean, explainable baseline that you can iterate on. Check the console output after python main.py for the full classification report.
- Replace stemming with lemmatization and compare.
- Try n‑grams in TF‑IDF (e.g.,
(1,2)and(1,3)). - Address class imbalance (oversampling or class weights).
- Compare models: Logistic Regression, Linear SVM, Complement NB.
- Add cross‑validation and hyperparameter tuning.
- Calibrate probabilities and adjust decision thresholds (optimize Recall for the spam class).
- Expose a small API (Flask/FastAPI) or CLI.
- Add a notebook for EDA and error analysis.
- Publish a short Model Card (limitations, risks, bias).
- K‑Drama Recommender → text representation & semantic similarity.
- F1 Winner Predictor → pipelines on tabular/time‑based data, metrics and validation.
- Spam Detector → supervised NLP with a classic TF‑IDF + NB pipeline.
Together, these weekend projects show my progression in collecting, cleaning, modeling, evaluating, and — most importantly — telling the story behind the code.
Is this production‑ready?
No — this is an educational baseline. A production system would require stronger evaluation, monitoring, security, privacy, and continuous retraining strategies.
Can I use another language (e.g., Portuguese emails)?
Yes, but you should adapt stopwords and potentially the preprocessing/tokenization accordingly.
Where do I ask questions or report issues?
Open an issue in the repository. I’m happy to improve this project with feedback!
Issues and pull requests are welcome. If you plan a larger contribution, please open an issue first to discuss scope and direction.
Code is released under the MIT License
The dataset is not included; please follow the Kaggle license and usage terms on the dataset page linked above.
- Oracle University for sparking this learning path.
- Kaggle and venky73 for the dataset.
- The NLP/ML community for the countless references and ideas.