A next-word prediction application powered by N-gram language models trained on WikiText-103.
This project implements a real-time text prediction system that suggests the next word as you type. Built with machine learning (N-gram models), it features a modern React frontend and FastAPI backend. The model is trained on WikiText-103, a high-quality dataset of Wikipedia articles containing over 100 million tokens.
text-predictor/
│
├── server/ # Python backend
│ ├── .venv/ # Python virtual environment
│ ├── api/ # FastAPI application
│ │ ├── main.py # API endpoints
│ │ ├── schemas.py # Request/response models
│ │ ├── models.py # ML model loader
│ │ └── requirements.txt # Python dependencies
│ ├── core/ # Core ML logic
│ │ ├── train.py # Model training script
│ │ ├── predict.py # Prediction engine
│ │ ├── evaluate.py # Model evaluation
│ │ └── cli.py # Command-line interface
│ ├── models/ # Trained models
│ │ ├── trigram_model.pkl
│ │ ├── bigram_model.pkl
│ │ ├── common_words.pkl
│ │ ├── sentence_starters.pkl
│ │ └── evaluation_results.pkl
│ └── data/ # Training data
│ └── wikitext-103/
│ ├── wiki.train.tokens
│ ├── wiki.test.tokens
│ └── wiki.valid.tokens
│
└── client/ # React frontend
├── node_modules/ # Node dependencies
├── src/
│ ├── components/ # React components
│ │ ├── Header.tsx
│ │ ├── TextInput.tsx
│ │ └── StatsPanel.tsx
│ ├── services/ # API integration
│ │ └── api.ts
│ ├── types/ # TypeScript types
│ │ └── index.ts
│ ├── App.tsx # Main app component
│ ├── main.tsx # Entry point
│ └── index.css # Global styles
├── public/
├── index.html
├── package.json
├── tsconfig.json
├── vite.config.ts
└── tailwind.config.js
- Python 3.8+
- Node.js 18+
- pip & npm
Navigate to server directory:
cd serverCreate virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activateInstall Python dependencies:
cd api
pip install -r requirements.txtDownload WikiText-103 dataset:
- Download from Hugging Face or Kaggle
- Extract to
server/data/wikitext-103/
Train the models(can tune max_vocab and min count as desire):
cd ../core
python train.pyThis will generate 4 model files in server/models/:
trigram_model.pkl(~50-200 MB)bigram_model.pkl(~20-50 MB)common_words.pkl(~1 MB)sentence_starters.pkl(~1 MB)
Navigate to client directory:
cd ../../clientInstall Node dependencies:
npm installTerminal 1 - Start Backend:
cd server/api
python main.pyBackend runs on: http://localhost:8000
Terminal 2 - Start Frontend:
cd client
npm run devFrontend runs on: http://localhost:5173
Open your browser and navigate to http://localhost:5173
Test predictions directly from terminal:
cd server/core
python cli.pyEvaluate model accuracy on test data:
cd server/core
python evaluate.pyExpected metrics:
- Top-1 Accuracy: ~18% (first prediction correct)
- Top-3 Accuracy: ~35% (correct word in top 3)
- Top-5 Accuracy: ~42% (correct word in top 5)
API documentation: http://localhost:8000/docs
- Trigram - Uses last 2 words for prediction (highest accuracy)
- Bigram - Falls back to last 1 word if trigram not found
- Common Words - Returns most frequent words as last resort
- Sentence Starters - Special case for empty input
- Dataset: WikiText-103 (~100M tokens)
- Vocabulary: 30,000 most common words
- Min frequency: 5 occurrences
- Training time: Depends on choosen max_vocab and min_count
- Model size: ~70-250 MB total
- Limited to 30k vocabulary (rare words not recognized)
- Formal writing style (trained on Wikipedia)
- No slang or modern internet language
- Emoji predictions
This project is licensed under the MIT License
Made with love 💘!.