Text Predictor

A next-word prediction application powered by N-gram language models trained on WikiText-103.

Hey champ, appreciate you giving this work a star! 🌟

Overview

This project implements a real-time text prediction system that suggests the next word as you type. Built with machine learning (N-gram models), it features a modern React frontend and FastAPI backend. The model is trained on WikiText-103, a high-quality dataset of Wikipedia articles containing over 100 million tokens.

Project Structure

text-predictor/
│
├── server/                     # Python backend
│   ├── .venv/                  # Python virtual environment
│   ├── api/                    # FastAPI application
│   │   ├── main.py             # API endpoints
│   │   ├── schemas.py          # Request/response models
│   │   ├── models.py           # ML model loader
│   │   └── requirements.txt    # Python dependencies
│   ├── core/                   # Core ML logic
│   │   ├── train.py            # Model training script
│   │   ├── predict.py          # Prediction engine
│   │   ├── evaluate.py         # Model evaluation
│   │   └── cli.py              # Command-line interface
│   ├── models/                 # Trained models
│   │   ├── trigram_model.pkl
│   │   ├── bigram_model.pkl
│   │   ├── common_words.pkl
│   │   ├── sentence_starters.pkl
│   │   └── evaluation_results.pkl
│   └── data/                   # Training data
│       └── wikitext-103/
│           ├── wiki.train.tokens
│           ├── wiki.test.tokens
│           └── wiki.valid.tokens
│
└── client/                     # React frontend
    ├── node_modules/           # Node dependencies
    ├── src/
    │   ├── components/         # React components
    │   │   ├── Header.tsx
    │   │   ├── TextInput.tsx
    │   │   └── StatsPanel.tsx
    │   ├── services/           # API integration
    │   │   └── api.ts
    │   ├── types/              # TypeScript types
    │   │   └── index.ts
    │   ├── App.tsx             # Main app component
    │   ├── main.tsx            # Entry point
    │   └── index.css           # Global styles
    ├── public/
    ├── index.html
    ├── package.json
    ├── tsconfig.json
    ├── vite.config.ts
    └── tailwind.config.js

Installation

Prerequisites

Python 3.8+
Node.js 18+
pip & npm

Backend Setup

Navigate to server directory:

cd server

Create virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install Python dependencies:

cd api
pip install -r requirements.txt

Download WikiText-103 dataset:

Download from Hugging Face or Kaggle
Extract to server/data/wikitext-103/

Train the models(can tune max_vocab and min count as desire):

cd ../core
python train.py

This will generate 4 model files in server/models/:

trigram_model.pkl (~50-200 MB)
bigram_model.pkl (~20-50 MB)
common_words.pkl (~1 MB)
sentence_starters.pkl (~1 MB)

Frontend Setup

Navigate to client directory:

cd ../../client

Install Node dependencies:

npm install

Usage

Running the Application

Terminal 1 - Start Backend:

cd server/api
python main.py

Backend runs on: http://localhost:8000

Terminal 2 - Start Frontend:

cd client
npm run dev

Frontend runs on: http://localhost:5173

Open your browser and navigate to http://localhost:5173

Command-Line Interface

Test predictions directly from terminal:

cd server/core
python cli.py

Model Evaluation

Evaluate model accuracy on test data:

cd server/core
python evaluate.py

Expected metrics:

Top-1 Accuracy: ~18% (first prediction correct)
Top-3 Accuracy: ~35% (correct word in top 3)
Top-5 Accuracy: ~42% (correct word in top 5)

API documentation: http://localhost:8000/docs

Model Architecture

Fallback Chain

Trigram - Uses last 2 words for prediction (highest accuracy)
Bigram - Falls back to last 1 word if trigram not found
Common Words - Returns most frequent words as last resort
Sentence Starters - Special case for empty input

Training Details

Dataset: WikiText-103 (~100M tokens)
Vocabulary: 30,000 most common words
Min frequency: 5 occurrences
Training time: Depends on choosen max_vocab and min_count
Model size: ~70-250 MB total

Known Limitations

Limited to 30k vocabulary (rare words not recognized)
Formal writing style (trained on Wikipedia)
No slang or modern internet language

Future Improvements

Emoji predictions

License

This project is licensed under the MIT License

Made with love 💘!.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
client		client
server		server
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Predictor

Hey champ, appreciate you giving this work a star! 🌟

Overview

Project Structure

Installation

Prerequisites

Backend Setup

Frontend Setup

Usage

Running the Application

Command-Line Interface

Model Evaluation

Model Architecture

Fallback Chain

Training Details

Known Limitations

Future Improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Predictor

Hey champ, appreciate you giving this work a star! 🌟

Overview

Project Structure

Installation

Prerequisites

Backend Setup

Frontend Setup

Usage

Running the Application

Command-Line Interface

Model Evaluation

Model Architecture

Fallback Chain

Training Details

Known Limitations

Future Improvements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages