Skip to content

santiguti/LLM-Sports-Quant

Repository files navigation

LLM Sports Quant

An educational project exploring how LLMs can be prompt-engineered into quantitative sports betting analysts — backed by Python-native mathematical governance to prevent hallucinated bets.

The system finds positive Expected Value (+EV) by comparing bookmaker odds against Python-computed true probabilities derived from real match data (form, H2H, Poisson-modeled expected goals). It uses Fractional Kelly Criterion to manage a virtual $1,000 bankroll.

Tech Stack

  • Language & Database: Python 3.13, PostgreSQL, SQLAlchemy ORM
  • AI Engine: Local or remote LLM inference via OpenAI-compatible API (Ollama, Claude, etc.)
  • Data Sources:
    • The Odds API — Live odds across bookmakers (h2h, totals) for Argentine Primera, Copa Libertadores
    • SofaScore — Real-time team form (last 10 matches, W/D/L, goals) and H2H history via cloudscraper
  • Key Dependencies: cloudscraper, openai, sqlalchemy, python-dotenv

How It Works

1. Market Sweeping

Pulls live odds for fixtures within 72 hours across configured leagues using a bulk API call per league (~3 API tokens per full run).

2. Data Gathering (SofaScore)

Python scrapes real match data directly from SofaScore:

  • Team Form: Last 10 completed matches with scores, W/D/L record, goals scored/conceded, and standard deviation
  • H2H: Cross-references both teams' recent events to find direct meetings

3. Python-Native True Probability (Authoritative)

All probability calculations happen in Python — the LLM never computes these:

Step 1: Form Cross-Reference
   Home Win = avg(home_win%, away_loss%)
   Draw     = avg(home_draw%, away_draw%)
   Away Win = avg(away_win%, home_loss%)

Step 2: H2H Adjustment (dynamic weight: 2 matches=15%, 3=20%, 5+=30%)
   Blended = form × (1-weight) + h2h × weight

Step 3: Home Advantage (+5% home, -5% away)

Step 4: Normalize to sum to 1.0

Step 5: Poisson Distribution for totals
   λ = expected_total_goals
   P(Under X.5) = Poisson CDF(floor(X.5), λ)

4. Python EV Scanner

Python scans every market outcome (home/draw/away/over/under) against its computed true probabilities:

  • Computes EV = true_prob - implied_prob for each outcome
  • Requires EV ≥ +5% to qualify
  • Computes Quarter-Kelly stake, capped at 5% of bankroll

5. LLM Advisory Role

The LLM receives all data and Python's computed probabilities. Its role is advisory — providing qualitative reasoning (injuries context, motivation, etc.). Python is the authority on:

  • ✅ Which outcome to bet
  • ✅ True probability values
  • ✅ Kelly stake sizing
  • ✅ EV threshold enforcement

6. Mathematical Validation & Logging

Every bet is fully traced:

📋 [BET DETAIL] Market: h2h | Odds: 1.73
✅ [+EV FOUND] Rosario Central (Home Win) @ 1.73 | True: 0.550 | Implied: 0.578 | EV: +7.2%
📊 [PYTHON STAKE] Quarter-Kelly 4.25% = $42.45
🎯 BET PLACED: Rosario Central (Home Win) @ odds 1.73 for $42.45 (EV: +7.2%)
💰 Bankroll: $957.55

7. Results Tracking

After matches finish, run check_results.py to:

  • Fetch final scores from SofaScore
  • Mark bets as win/loss
  • Calculate P&L per bet
  • Credit winnings back to bankroll

Running the Project

Prerequisites

  • Python 3.13+ with venv
  • PostgreSQL database
  • Ollama (or any OpenAI-compatible endpoint)

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Environment Variables (.env)

ODDS_API_KEY=your_key_here
LLM_BASE_URL=http://localhost:11434/v1  # Ollama default
LLM_API_KEY=ollama                       # or your API key
LLM_MODEL=qwen2.5:14b                   # or any model

Run

# Place bets on upcoming matches
python main.py

# Check results after matches finish
python check_results.py

# Test data sources
python tests/test_sources.py

Swap LLM engine

LLM_MODEL=llama3 python main.py

Architecture Safeguards

Gate Rule
Minimum Form Data Requires ≥5 recent matches per team
EV Threshold Rejects any bet with EV < +5%
Fractional Kelly (0.25×) Quarter-Kelly sizing to reduce variance
Bankroll Cap Max 5% of bankroll per bet
Implied Prob Sanity Validates implied_prob ≈ 1/odds
Python Authority True probability computed in Python, never by the LLM

Database Schema

Table Purpose
matches Fixtures with home/away teams, scores, status
market_odds Live bookmaker odds per outcome
bets Placed bets with stake, odds, EV%, true prob, result, P&L
bet_logs Bet lifecycle events (Placed → Resolved)
bankroll Current balance
bankroll_logs Transaction history (bets placed, winnings credited)
betting_reasoning LLM reasoning text per match

Project Structure

├── main.py              # Entry point — fetches odds, scrapes form, places bets
├── agent.py             # LLM prompt engineering and advisory analysis
├── search.py            # SofaScore scraping + Python true probability engine
├── sofa_scraper.py      # Low-level SofaScore API via cloudscraper
├── pipeline.py          # Odds API fetching and parsing
├── check_results.py     # Post-match results resolution and P&L tracking
├── models.py            # SQLAlchemy ORM models
├── database.py          # DB connection and session management
├── tests/
│   └── test_sources.py  # Data source validation tests
├── requirements.txt
└── .env                 # API keys (not committed)

Future Improvements

  • Backtesting: Simulate the pipeline over previous seasons to validate the +EV strategy's long-term ROI.
  • Line Movement Tracking: Store periodic odds snapshots to detect sharp market moves and bet before retail books adjust.
  • Multi-Bookmaker Arbitrage: Scan pricing gaps across bookmakers for risk-free arbitrage opportunities.
  • Enhanced xG: Integrate shot-quality data (Opta/StatsBomb) for more accurate Poisson modeling instead of goals-based averages.
  • Rate Limiting: Add time.sleep() delays or proxy rotation for SofaScore scraping at scale.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages