Belarusian NLP and Speech Processing resources

A curated list of Belarusian Natural Language and Speech Processing resources, datasets, and models.

Inspired by egorsmkv/speech-recognition-uk.

🤝 Contributing

Contributions are very welcome! Please open a pull request to add missing resources, update entries, or fix any of the TODOs below. Help keep this list current.

🏡 Communities and platforms

say.by — Telegram
corpus.by
ssrlab.by
bnkorpus.info
Belarus — GitHub organization
nlproc.by — GitHub community

📚 Datasets

Name	Type	Notes
Common Voice	STT	Mozilla multilingual speech corpus
google/fleurs	STT
knihi.com — Корпус беларускага маўлення	STT	Corpus for neural network training. TODO: confirm dataset type
ssrlab	STT	TODO
maaxap/BelarusianGLUE	NLP	Paper
oscar-corpus/oscar	NLP
allenai/c4	NLP
poritski/YABC	NLP	Эксперыментальны корпус беларускай мовы (ЭКБМ)
Belarus/GrammarDB	NLP	Grammar Database of Belarusian language
tsimafeip/Translator	NLP	Russian-Belarusian translation pairs
UniversalDependencies/UD_Belarusian-HSE	NLP	Universal Dependencies treebank. Project page
Tatoeba — Belarusian	NLP	Belarusian sentences in Tatoeba

🎤 Speech-to-Text Models

Name	Metrics	Notes
ales/wav2vec2-cv-be	WER `12.4` on test split of CV8	wav2vec2 + kenlm language model, trained on Common Voice 8. Demo, Code
ales/whisper-small-belarusian	WER `6.79` on test split of CV11	Whisper Small fine-tuned on Common Voice 11. Demo, Code
ales/whisper-base-belarusian	WER `12.207` on validation split of CV11	Whisper Base fine-tuned on Common Voice 11. Code
openai/whisper	-	Original Whisper models from OpenAI
nvidia/stt_be_conformer_ctc_large	WER `4.8` on test split of CV10	NVIDIA NeMo Conformer-CTC
nvidia/stt_be_conformer_transducer_large	WER `3.8` on test split of CV10	NVIDIA NeMo Conformer-Transducer
nvidia/stt_be_fastconformer_hybrid_large_pc	WER `2.72` on test split of CV12 WER P&C `3.87` on test split of CV12	NVIDIA NeMo FastConformer Hybrid
espnet/belarusian_commonvoice_blstm	-	ESPnet BLSTM trained on Common Voice

🤖 Text-to-Speech Models

Name	Metrics	Notes
coqui-ai/TTS	-	Official CoquiAI Belarusian recipe
jhlfrfufyfn/bel-tts	-	GlowTTS + HifiGAN. Model, HF demo, web server source code
alex73/belarusian-tts	-	CoquiAI implementation by Yurii Paniv (@robinhad). Original repo and models deleted — only fork remains

📝 NLP Models and Tools

Name	Topic	Notes
alesdrobysh/belmorph	Morphological analysis	Fast TypeScipt morphological analyzer, inflection, and lexeme generator. Demo, a really beautiful one!
pkasila/bel-sklony	Declension (вызначэнне меснага склону)	Belarusian nouns declension (месны склон only) web page. Demo
poritski/YABC_Tagger	POS-tagging, Lemmatization	Rule-based. Perl. Uses poritski/YABC as grammar base
volchek/beltagger	POS-tagging, Lemmatization	Improved C++ port of poritski/YABC_Tagger. Cross-platform. Known issues: requires Windows-1251 input (no UTF-8), tagset not fully compatible with BNKorpus, incomplete grammar base (Belarus/GrammarDB not yet incorporated), suffix table calculation not ported from Perl, depends on Boost
stanfordnlp/stanza-be	POS-tagging	Stanza framework
KoichiYasuoka/roberta-small-belarusian-upos	POS-tagging	roberta model
KoichiYasuoka/roberta-small-belarusian	Masked Language Modeling	roberta model

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Belarusian NLP and Speech Processing resources

Table of Contents

🤝 Contributing

🏡 Communities and platforms

📚 Datasets

🎤 Speech-to-Text Models

🤖 Text-to-Speech Models

📝 NLP Models and Tools

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Belarusian NLP and Speech Processing resources

Table of Contents

🤝 Contributing

🏡 Communities and platforms

📚 Datasets

🎤 Speech-to-Text Models

🤖 Text-to-Speech Models

📝 NLP Models and Tools

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages