Skip to content

Theo2co/equity-return-prediction-ml

Repository files navigation

Machine-Learning in Finance

This project aims to forecast monthly U.S. stock returns using machine learning, leveraging financial characteristics observable at the time of prediction. The target variable is derived from the monthly CRSP dataset, while predictors come from quarterly Compustat fundamentals and JKP factor characteristics, both offering rich insights into firm behavior and asset pricing dynamics.

📁 Project Structure

├── main.py                 # Main script to run the full pipeline
├── data_preprocessing.py  # Raw-to-ML dataset construction
├── linear_models.py       # Linear regression and shrinkage models (e.g., Ridge, Lasso)
├── mlp_model.py           # Feedforward neural network (MLP)
├── xgboost_model.py       # Gradient boosting model
├── requirements.txt       # Required dependencies
├── data/
│   ├── Predictors/        # Merged predictors and processed X
│   │   ├── ccmxpf_linktable.csv
│   │   ├── CompFirmCharac.csv
│   │   ├── jkp_characteristic.plk
│   │   ├── merged.plk
│   │   └── X.pkl
│   └── Targets/
│       └── monthly_crsp.csv
│       └── y.pkl
├── plots/                 # Output directory for plots
├── project_report.pdf              # Full project report and results

✅ Getting Started

1. Install Dependencies

Ensure you are using Python 3.8+ and install required packages:

pip install -r requirements.txt

It's recommended to use a virtual environment (venv or conda).

2. Prepare the Data

Download the raw data and place raw data in the appropriate folders:

data/
├── Predictors/
│   ├── ccmxpf_linktable.csv
│   ├── CompFirmCharac.csv
│   ├── jkp_characteristic.plk
└── Targets/
    └── monthly_crsp.csv

Processed datasets (merged.pkl, X.pkl, y.pkl) will be automatically generated and saved in the same directories.

3. Run the Pipeline

To execute the full workflow from data preprocessing to model training and evaluation:

python main.py

This will train models, evaluate them, and save results and plots in the plots/ directory.

🧠 Models Implemented

  • OLS, Ridge, and Lasso Regression (linear_models.py)
  • Multi-Layer Perceptron (MLP) (mlp_model.py)
  • XGBoost Regressor (xgboost_model.py)

📊 Outputs

  • Plots of predicted vs. actual returns
  • IC Distribution, Rolling IC
  • All outputs saved in the plots/ folder

About

ML pipeline for monthly U.S. equity return prediction using CRSP / Compustat / JKP factor characteristics. Implements OLS, Ridge, Lasso, XGBoost, and MLP models with rolling-window evaluation and IC analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages