This project provides a comprehensive suite of tools to generate ground truth (GT) data for Tesseract OCR for the Tamil language. It includes scripts for text normalization, image and GT file generation, data verification, and OCR evaluation.
- Text Normalization: Pre-processes raw text files into a clean, trainable format.
- Ground Truth Generation: Creates TIFF images and corresponding
.gt.txtfiles for each line of text. - Font Flexibility: Uses a variety of Tamil fonts to generate diverse training data.
- Data Verification: Includes a sample script to verify the integrity of the generated dataset.
- OCR Evaluation: Provides tools to calculate Character Error Rate (CER) and Word Error Rate (WER).
- Frequency Analysis: Scripts to analyze character and word frequencies in the dataset.
The project follows a clear workflow:
- Data Preparation: Raw text files (from
raw_data/or other sources like JSON) are processed.json2text.pycan be used to convert JSON data to text. - Text Normalization: The
normalize-gt.pyscript merges and normalizes the text data, creatingdata/training-data.txt. - GT Generation: The
generate-gt.pyscript takes the normalized text and generates.tifimages and.gt.txtfiles in thegt/directory. - Verification: The
verify.pyscript is a sample that can be adapted to check the consistency of the generated files. - Evaluation: After training a Tesseract model with the generated data, the
cer_wer_tamil.pyscript can be used to evaluate its performance. - Analysis:
find_cfr.pycan be used to analyze the character and word frequencies of the dataset and generate a graph of the character frequencies.
-
Clone the repository:
git clone https://github.com/khaleeljageer/tesseract-gt-builder.git cd tesseract-gt-builder -
Install the required Python packages:
pip install -r requirements.txt
- Place your raw
.txtfiles in theraw_data/directory. - If you have JSON data, you can use
json2text.pyto convert it to text. You might need to modify the script to fit your JSON structure and specify the correct input file name. The script currently expects a file namedtamil-articles-from-wikinews.jsonwhich is not included in this repository.
Run the normalize-gt.py script to create the training data file:
python normalize-gt.pyThis will create data/training-data.txt.
The generate-gt.py script uses data/sample.txt by default. To use the newly generated data/training-data.txt, you need to modify the TEXT_FILE variable in generate-gt.py.
Run the generate-gt.py script to generate the images and GT files:
python generate-gt.pyThe output will be saved in the gt/ directory.
The verify.py script is a sample script designed for a specific directory structure. You will need to modify it significantly to work with the output of generate-gt.py.
After training your model, you can evaluate it using cer_wer_tamil.py:
python cer_wer_tamil.py --ground_truth /path/to/your/ground-truth.txt --prediction /path/to/your/ocr-output.txtTo analyze the character and word frequencies in your dataset, run:
python find_cfr.pyThis will also generate a char_freq_graph.png file with a graph of the top 10 character frequencies.
config.py: Configuration for the data generation process.generate-gt.py: Main script to generate TIFF images and GT files.normalize-gt.py: Script to normalize the text data.verify.py: Sample script to verify the generated dataset.cer_wer_tamil.py: Script to calculate CER and WER.find_cfr.py: Script to find character and word frequencies and generate a frequency graph.json2text.py: Utility to convert JSON to text.requirements.txt: List of required Python packages.data/: Directory for training data.raw_data/: Directory for raw text files.fonts/: Directory containing TTF font files.gt/: Directory where the generated TIFF images and GT files are saved.cer_wer/: Directory containing sample files for CER/WER calculation.model/: Directory for trained models.
- Pillow
- tqdm
- open-tamil
- jiwer
- matplotlib
- mplcairo
- opencv-python
- numpy
This project is licensed under the GNU General Public License v3.0. See LICENSE.
If you use this repository or the associated dataset in your research, please cite:
Dataset:
@dataset{tamilocr_dataset_2025,
author = {Syedkhaleel Jageer},
title = {Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 27 Diverse Fonts with Corresponding Ground Truth Annotations},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.16881612},
url = {https://doi.org/10.5281/zenodo.16881612}
}
Code Repository:
@misc{jageer2025tesseractGTBuilder,
author = {Syedkhaleel Jageer},
title = {{Tesseract-GT-Builder: Tools to generate ground-truth data for Tesseract OCR (Tamil)}},
howpublished = {\url{https://github.com/khaleeljageer/tesseract-gt-builder}},
year = {2025},
note = {Accessed: August 15, 2025}
}