Tesseract GT Builder for Tamil OCR

This project provides a comprehensive suite of tools to generate ground truth (GT) data for Tesseract OCR for the Tamil language. It includes scripts for text normalization, image and GT file generation, data verification, and OCR evaluation.

Features

Text Normalization: Pre-processes raw text files into a clean, trainable format.
Ground Truth Generation: Creates TIFF images and corresponding .gt.txt files for each line of text.
Font Flexibility: Uses a variety of Tamil fonts to generate diverse training data.
Data Verification: Includes a sample script to verify the integrity of the generated dataset.
OCR Evaluation: Provides tools to calculate Character Error Rate (CER) and Word Error Rate (WER).
Frequency Analysis: Scripts to analyze character and word frequencies in the dataset.

Workflow

The project follows a clear workflow:

Data Preparation: Raw text files (from raw_data/ or other sources like JSON) are processed. json2text.py can be used to convert JSON data to text.
Text Normalization: The normalize-gt.py script merges and normalizes the text data, creating data/training-data.txt.
GT Generation: The generate-gt.py script takes the normalized text and generates .tif images and .gt.txt files in the gt/ directory.
Verification: The verify.py script is a sample that can be adapted to check the consistency of the generated files.
Evaluation: After training a Tesseract model with the generated data, the cer_wer_tamil.py script can be used to evaluate its performance.
Analysis: find_cfr.py can be used to analyze the character and word frequencies of the dataset and generate a graph of the character frequencies.

Installation

Clone the repository:

git clone https://github.com/khaleeljageer/tesseract-gt-builder.git
cd tesseract-gt-builder

Install the required Python packages:
```
pip install -r requirements.txt
```

Usage

1. Prepare Your Data

Place your raw .txt files in the raw_data/ directory.
If you have JSON data, you can use json2text.py to convert it to text. You might need to modify the script to fit your JSON structure and specify the correct input file name. The script currently expects a file named tamil-articles-from-wikinews.json which is not included in this repository.

2. Normalize the Text

Run the normalize-gt.py script to create the training data file:

python normalize-gt.py

This will create data/training-data.txt.

3. Generate Ground Truth Data

The generate-gt.py script uses data/sample.txt by default. To use the newly generated data/training-data.txt, you need to modify the TEXT_FILE variable in generate-gt.py.

Run the generate-gt.py script to generate the images and GT files:

python generate-gt.py

The output will be saved in the gt/ directory.

4. Verify the Generated Data

The verify.py script is a sample script designed for a specific directory structure. You will need to modify it significantly to work with the output of generate-gt.py.

5. Evaluate Your OCR Model

After training your model, you can evaluate it using cer_wer_tamil.py:

python cer_wer_tamil.py --ground_truth /path/to/your/ground-truth.txt --prediction /path/to/your/ocr-output.txt

6. Analyze the Dataset

To analyze the character and word frequencies in your dataset, run:

python find_cfr.py

This will also generate a char_freq_graph.png file with a graph of the top 10 character frequencies.

Project Structure

config.py: Configuration for the data generation process.
generate-gt.py: Main script to generate TIFF images and GT files.
normalize-gt.py: Script to normalize the text data.
verify.py: Sample script to verify the generated dataset.
cer_wer_tamil.py: Script to calculate CER and WER.
find_cfr.py: Script to find character and word frequencies and generate a frequency graph.
json2text.py: Utility to convert JSON to text.
requirements.txt: List of required Python packages.
data/: Directory for training data.
raw_data/: Directory for raw text files.
fonts/: Directory containing TTF font files.
gt/: Directory where the generated TIFF images and GT files are saved.
cer_wer/: Directory containing sample files for CER/WER calculation.
model/: Directory for trained models.

Dependencies

Pillow
tqdm
open-tamil
jiwer
matplotlib
mplcairo
opencv-python
numpy

License

This project is licensed under the GNU General Public License v3.0. See LICENSE.

Citation

If you use this repository or the associated dataset in your research, please cite:

Dataset:

@dataset{tamilocr_dataset_2025,
author = {Syedkhaleel Jageer},
title = {Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 27 Diverse Fonts with Corresponding Ground Truth Annotations},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.16881612},
url = {https://doi.org/10.5281/zenodo.16881612}
}

Code Repository:

@misc{jageer2025tesseractGTBuilder,
author = {Syedkhaleel Jageer},
title = {{Tesseract-GT-Builder: Tools to generate ground-truth data for Tesseract OCR (Tamil)}},
howpublished = {\url{https://github.com/khaleeljageer/tesseract-gt-builder}},
year = {2025},
note = {Accessed: August 15, 2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tesseract GT Builder for Tamil OCR

Features

Workflow

Installation

Usage

1. Prepare Your Data

2. Normalize the Text

3. Generate Ground Truth Data

4. Verify the Generated Data

5. Evaluate Your OCR Model

6. Analyze the Dataset

Project Structure

Dependencies

License

Citation

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
.idea		.idea
cer_wer		cer_wer
data		data
fonts		fonts
model		model
raw_data		raw_data
tmp		tmp
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cer_wer_tamil.py		cer_wer_tamil.py
config.py		config.py
find_cfr.py		find_cfr.py
generate-gt.py		generate-gt.py
json2text.py		json2text.py
normalize-gt.py		normalize-gt.py
requirements.txt		requirements.txt
verify.py		verify.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Tesseract GT Builder for Tamil OCR

Features

Workflow

Installation

Usage

1. Prepare Your Data

2. Normalize the Text

3. Generate Ground Truth Data

4. Verify the Generated Data

5. Evaluate Your OCR Model

6. Analyze the Dataset

Project Structure

Dependencies

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages