Skip to content

navalnica/be_nlp_speech_resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 

Repository files navigation

Belarusian NLP and Speech Processing resources

A curated list of Belarusian Natural Language and Speech Processing resources, datasets, and models.

Inspired by egorsmkv/speech-recognition-uk.

Table of Contents

🤝 Contributing

Contributions are very welcome! Please open a pull request to add missing resources, update entries, or fix any of the TODOs below. Help keep this list current.

🏡 Communities and platforms

📚 Datasets

Name Type Notes
Common Voice STT Mozilla multilingual speech corpus
google/fleurs STT
knihi.com — Корпус беларускага маўлення STT Corpus for neural network training. TODO: confirm dataset type
ssrlab STT TODO
maaxap/BelarusianGLUE NLP Paper
oscar-corpus/oscar NLP
allenai/c4 NLP
poritski/YABC NLP Эксперыментальны корпус беларускай мовы (ЭКБМ)
Belarus/GrammarDB NLP Grammar Database of Belarusian language
tsimafeip/Translator NLP Russian-Belarusian translation pairs
UniversalDependencies/UD_Belarusian-HSE NLP Universal Dependencies treebank. Project page
Tatoeba — Belarusian NLP Belarusian sentences in Tatoeba

🎤 Speech-to-Text Models

Name Metrics Notes
ales/wav2vec2-cv-be WER 12.4 on test split of CV8 wav2vec2 + kenlm language model, trained on Common Voice 8. Demo, Code
ales/whisper-small-belarusian WER 6.79 on test split of CV11 Whisper Small fine-tuned on Common Voice 11. Demo, Code
ales/whisper-base-belarusian WER 12.207 on validation split of CV11 Whisper Base fine-tuned on Common Voice 11. Code
openai/whisper - Original Whisper models from OpenAI
nvidia/stt_be_conformer_ctc_large WER 4.8 on test split of CV10 NVIDIA NeMo Conformer-CTC
nvidia/stt_be_conformer_transducer_large WER 3.8 on test split of CV10 NVIDIA NeMo Conformer-Transducer
nvidia/stt_be_fastconformer_hybrid_large_pc WER 2.72 on test split of CV12
WER P&C 3.87 on test split of CV12
NVIDIA NeMo FastConformer Hybrid
espnet/belarusian_commonvoice_blstm - ESPnet BLSTM trained on Common Voice

🤖 Text-to-Speech Models

Name Metrics Notes
coqui-ai/TTS - Official CoquiAI Belarusian recipe
jhlfrfufyfn/bel-tts - GlowTTS + HifiGAN. Model, HF demo, web server source code
alex73/belarusian-tts - CoquiAI implementation by Yurii Paniv (@robinhad). Original repo and models deleted — only fork remains

📝 NLP Models and Tools

Name Topic Notes
alesdrobysh/belmorph Morphological analysis Fast TypeScipt morphological analyzer, inflection, and lexeme generator. Demo, a really beautiful one!
pkasila/bel-sklony Declension (вызначэнне меснага склону) Belarusian nouns declension (месны склон only) web page. Demo
poritski/YABC_Tagger POS-tagging, Lemmatization Rule-based. Perl. Uses poritski/YABC as grammar base
volchek/beltagger POS-tagging, Lemmatization Improved C++ port of poritski/YABC_Tagger. Cross-platform. Known issues: requires Windows-1251 input (no UTF-8), tagset not fully compatible with BNKorpus, incomplete grammar base (Belarus/GrammarDB not yet incorporated), suffix table calculation not ported from Perl, depends on Boost
stanfordnlp/stanza-be POS-tagging Stanza framework
KoichiYasuoka/roberta-small-belarusian-upos POS-tagging roberta model
KoichiYasuoka/roberta-small-belarusian Masked Language Modeling roberta model

Releases

No releases published

Packages

 
 
 

Contributors