This project investigates reward engineering, GRPO fine-tuning and model merging for Turkish multiple-choice question answering using GPT-2-based language models.
The main focus is not only improving answer accuracy, but also analyzing model behavior under different reward strategies, especially mode collapse, output diversity and format compliance.
Large language models can follow answer-format instructions after supervised fine-tuning, but they may still suffer from limited answer diversity and mode collapse.
In this project, a Turkish multiple-choice question answering task was used to evaluate how different reward functions affect model behavior.
The workflow includes:
- Supervised Fine-Tuning (SFT)
- GRPO-based reward optimization
- Reward function design
- Diversity and entropy-based behavior analysis
- Model merging with MergeKit
- Comparative evaluation of output distributions
The project pipeline consists of the following stages:
- Dataset subset selection
- Prompt construction for A–E multiple-choice questions
- Data cleaning and answer-label formatting
- Supervised Fine-Tuning (SFT)
- GRPO fine-tuning with multiple reward functions
- Model merging using MergeKit
- Evaluation with accuracy, format rate, dominant answer share and prediction entropy
The following reward strategies were evaluated:
- Accuracy reward
- Format compliance reward
Encourages different answer choices across multiple generations for the same prompt.
Uses the entropy of the generated answer distribution to encourage balanced predictions.
Combines format compliance with diversity encouragement.
Reduces repetitive answer behavior without applying direct negative penalties.
After training multiple GRPO variants, the models were merged using MergeKit.
The merging strategy was designed to combine:
- baseline model stability,
- diversity-oriented behavior,
- entropy-based distributional balance,
- and format-aware reward behavior.
The merged model was used to analyze whether parameter-level merging could reduce mode collapse more effectively than reward engineering alone.
The experiments showed that SFT successfully taught the model the required output format.
However, GRPO reward optimization alone was not sufficient to fully resolve mode collapse.
The merged model reduced the dominant answer share and produced a more diverse prediction distribution compared to individual GRPO models.
- SFT achieved strong format compliance.
- GRPO variants maintained high format rate.
- Individual reward strategies could not fully eliminate mode collapse.
- MergeKit-based model merging improved prediction diversity.
- Output diversity improved more clearly than accuracy.
The models were evaluated using:
- Accuracy
- Format Rate
- Dominant Answer Share
- Prediction Entropy
- A–E Prediction Distribution
These metrics were selected to evaluate not only correctness, but also behavioral characteristics of the generated outputs.
turkish-grpo-reward-ensembles/
│
├── README.md
├── requirements.txt
├── RewardEnsembles.ipynb
├── merge_multi.yaml
│
└── results/
- PyTorch
- Hugging Face Transformers
- TRL
- PEFT
- MergeKit
- Hugging Face Hub
- Pandas
- NumPy
- Matplotlib
- scikit-learn
- Large Language Models
- GRPO
- Reward Engineering
- Supervised Fine-Tuning
- Model Merging
- MergeKit
- Mode Collapse Analysis
- Turkish NLP
- Multiple-Choice Question Answering
Emir Kaan SAIT
Yıldız Technical University M.Sc. Computer Engineering