A curated list of Belarusian Natural Language and Speech Processing resources, datasets, and models.
Inspired by egorsmkv/speech-recognition-uk.
- 🤝 Contributing
- 🏡 Communities and platforms
- 📚 Datasets
- 🎤 Speech-to-Text Models
- 🤖 Text-to-Speech Models
- 📝 NLP Models and Tools
Contributions are very welcome! Please open a pull request to add missing resources, update entries, or fix any of the TODOs below. Help keep this list current.
- say.by — Telegram
- corpus.by
- ssrlab.by
- bnkorpus.info
- Belarus — GitHub organization
- nlproc.by — GitHub community
| Name | Type | Notes |
|---|---|---|
| Common Voice | STT | Mozilla multilingual speech corpus |
| google/fleurs | STT | |
| knihi.com — Корпус беларускага маўлення | STT | Corpus for neural network training. TODO: confirm dataset type |
| ssrlab | STT | TODO |
| maaxap/BelarusianGLUE | NLP | Paper |
| oscar-corpus/oscar | NLP | |
| allenai/c4 | NLP | |
| poritski/YABC | NLP | Эксперыментальны корпус беларускай мовы (ЭКБМ) |
| Belarus/GrammarDB | NLP | Grammar Database of Belarusian language |
| tsimafeip/Translator | NLP | Russian-Belarusian translation pairs |
| UniversalDependencies/UD_Belarusian-HSE | NLP | Universal Dependencies treebank. Project page |
| Tatoeba — Belarusian | NLP | Belarusian sentences in Tatoeba |
| Name | Metrics | Notes |
|---|---|---|
| ales/wav2vec2-cv-be | WER 12.4 on test split of CV8 |
wav2vec2 + kenlm language model, trained on Common Voice 8. Demo, Code |
| ales/whisper-small-belarusian | WER 6.79 on test split of CV11 |
Whisper Small fine-tuned on Common Voice 11. Demo, Code |
| ales/whisper-base-belarusian | WER 12.207 on validation split of CV11 |
Whisper Base fine-tuned on Common Voice 11. Code |
| openai/whisper | - | Original Whisper models from OpenAI |
| nvidia/stt_be_conformer_ctc_large | WER 4.8 on test split of CV10 |
NVIDIA NeMo Conformer-CTC |
| nvidia/stt_be_conformer_transducer_large | WER 3.8 on test split of CV10 |
NVIDIA NeMo Conformer-Transducer |
| nvidia/stt_be_fastconformer_hybrid_large_pc | WER 2.72 on test split of CV12WER P&C 3.87 on test split of CV12 |
NVIDIA NeMo FastConformer Hybrid |
| espnet/belarusian_commonvoice_blstm | - | ESPnet BLSTM trained on Common Voice |
| Name | Metrics | Notes |
|---|---|---|
| coqui-ai/TTS | - | Official CoquiAI Belarusian recipe |
| jhlfrfufyfn/bel-tts | - | GlowTTS + HifiGAN. Model, HF demo, web server source code |
| alex73/belarusian-tts | - | CoquiAI implementation by Yurii Paniv (@robinhad). Original repo and models deleted — only fork remains |
| Name | Topic | Notes |
|---|---|---|
| alesdrobysh/belmorph | Morphological analysis | Fast TypeScipt morphological analyzer, inflection, and lexeme generator. Demo, a really beautiful one! |
| pkasila/bel-sklony | Declension (вызначэнне меснага склону) | Belarusian nouns declension (месны склон only) web page. Demo |
| poritski/YABC_Tagger | POS-tagging, Lemmatization | Rule-based. Perl. Uses poritski/YABC as grammar base |
| volchek/beltagger | POS-tagging, Lemmatization | Improved C++ port of poritski/YABC_Tagger. Cross-platform. Known issues: requires Windows-1251 input (no UTF-8), tagset not fully compatible with BNKorpus, incomplete grammar base (Belarus/GrammarDB not yet incorporated), suffix table calculation not ported from Perl, depends on Boost |
| stanfordnlp/stanza-be | POS-tagging | Stanza framework |
| KoichiYasuoka/roberta-small-belarusian-upos | POS-tagging | roberta model |
| KoichiYasuoka/roberta-small-belarusian | Masked Language Modeling | roberta model |