Skip to content

Latest commit

 

History

History
329 lines (246 loc) · 10.7 KB

File metadata and controls

329 lines (246 loc) · 10.7 KB

MWA — Multilingual Word Aligner

Python 3.11+ PyTorch 2.4 HuggingFace Models arXiv License: CC BY-NC 4.0


State-of-the-art open-source speech–text word alignment for 1000+ languages.


Citation

If you use MWA in your research, please cite:

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming
Roy Weber, Meidan Zehavi, Rotem Rousso, Joseph Keshet
The 27th Annual Conference of the International Speech Communication Association (Interspeech), 2026
https://arxiv.org/abs/2606.10675

@inproceedings{weber2026multilingual,
  title     = {Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming},
  author    = {Weber, Roy and Zehavi, Meidan and Rousso, Rotem and Keshet, Joseph},
  booktitle = {Proceedings of the 27th Annual Conference of the International Speech Communication Association (Interspeech)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2606.10675}
}

What is MWA?

MWA maps spoken audio to its transcript at the word level, producing a precise start and end timestamp for every word. It works for read speech, conversational speech, and languages never seen during training.

MWA outperforms all leading speech–text aligners on the TIMIT and Buckeye corpora. It also generalises out of the box to all 1000+ languages supported by Meta's MMS model, including Hebrew, Dutch, German, Arabic, and many more.

Model performance comparison


How it works

MWA is a three-stage ensemble pipeline:

Audio + Transcript
       │
       ├─── MMS-FA ──────────────────┐  character-level emission probs
       │                             │
       └─── UnsupSeg CNN ────────────┤  boundary representations
                                     │
                              Conformer (16 blocks)
                                     │  frame-wise boundary probs @ 10 ms
                                     │
                           Dynamic Programming
                                     │  penalised optimisation over
                                     │  model probs + MMS emissions
                                     │  + acoustic boundary distances
                                     │
                         Word timestamps (CSV + TextGrid)
  1. MMS-FA (Pratap et al., 2023) — Meta's massively multilingual forced aligner provides character-level emission probabilities.
  2. UnsupSeg (Kreuk et al., 2020) — A self-supervised CNN encoder that learns acoustic boundary cues without any labels.
  3. Conformer — A 16-block conformer trained on top of the concatenated features, outputting per-frame boundary probabilities at 10 ms resolution.
  4. DP alignment — A penalty-aware dynamic programming pass combines all signals to place word boundaries at the globally optimal positions.

Models

Name HuggingFace Trained on Best suited for
timit MLSpeech/mwa-timit TIMIT corpus Read / formal speech
buckeye MLSpeech/mwa-buckeye Buckeye corpus Conversational / fluent speech

Both models were trained on American English corpora but leverage MMS-FA features, enabling alignment for all 1000+ languages supported by MMS.

Weights are downloaded automatically from HuggingFace on first use.


Supported Languages

MWA supports all 1000+ languages covered by Meta's MMS model. For the full list of languages and their ISO 639-3 codes, see the official MMS language list:

MMS supported languages and codes

Note: MWA only requires MMS's Language Identification (LID) support, not full ASR support. This means any language listed under LID in the MMS coverage table is supported — a much broader set than the ASR-only languages.

MWA uses uroman to romanize non-Latin scripts automatically. Pass the ISO 639-3 code via --language.

Example languages: English (eng), Spanish (spa), French (fra), German (deu), Arabic (ara), Hindi (hin), Mandarin Chinese (cmn), Japanese (jpn), Russian (rus), Portuguese (por), Italian (ita), Dutch (nld), Korean (kor), Turkish (tur), Polish (pol), Swedish (swe), Hebrew (heb), Persian (fas), Vietnamese (vie), Swahili (swh), ...


Installation

Requirements: Python 3.11, 16 kHz audio input.

Conda (recommended)

git clone https://github.com/MLSpeech/Multilingual-Word-Aligner.git
cd Multilingual-Word-Aligner

conda env create -f environment.yml
conda activate Mwa_venv
pip install -e .

pip / venv

git clone https://github.com/MLSpeech/Multilingual-Word-Aligner.git
cd Multilingual-Word-Aligner

python3.11 -m venv Mwa_venv
source Mwa_venv/bin/activate        # Linux / macOS
# Mwa_venv\Scripts\activate         # Windows

pip install -r requirements.txt
pip install -e .

Usage

Simple CLI

mwa align <model_name> [language] --input_dir <path> --output_dir <path>
# Align all files in a directory (English, conversational)
mwa align buckeye eng --input_dir ./data/ --output_dir ./results/

# Language defaults to 'eng' when omitted
mwa align timit --input_dir ./data/ --output_dir ./results/

Full align_wav.py interface

python align_wav.py \
    --wav_input        /path/to/audio/       \
    --transcript_input /path/to/transcripts/ \
    --language         eng                   \
    --model_name       buckeye               \
    --device           cuda:0                \
    --output_folder    ./results/

All arguments

Argument Type Default Description
--wav_input path .wav/.flac file or directory of audio files
--transcript_input path .txt/.TextGrid file or directory of transcript files
--language str eng ISO 639-3 language code (see Supported Languages)
--model_name str timit Pretrained model: timit or buckeye
--device str cpu PyTorch device: cpu, cuda:0, cuda:1, …
--output_folder path Output directory (created automatically if missing)
--no_graph flag off Suppress PNG visualisation output
--no_csv flag off Suppress CSV and TextGrid output

Data Preparation

Organise your files so that each audio file has a matching transcript with the same base name:

dataset/
├── interview_01.wav
├── interview_01.txt        ← plain text, one utterance per line
├── lecture_02.flac
├── lecture_02.TextGrid     ← Praat TextGrid with a "sentence" tier
└── ...

Transcript formats

Format Rules
.txt One sentence per file; words separated by spaces
.TextGrid Praat format; must contain a tier named sentence

Audio requirements: .wav or .flac, 16 kHz sample rate.


Output Reference

File Contents
<name>.csv Word, Start_Time, End_Time — one row per word, times in seconds
<name>.TextGrid Praat TextGrid with a words interval tier
<name>_graph1.png Waveform (top) and frame-level boundary probabilities (bottom) with DP boundaries overlaid in blue and Conformer predictions in green

GPU Acceleration

Pass --device cuda:0 to move all models to GPU. This is strongly recommended for large batches. All three models (MMS, UnsupSeg, Conformer) are loaded once at startup and shared across every file in the batch, so per-file cost is pure inference with no reload overhead.

python align_wav.py \
    --wav_input        /data/corpus/ \
    --transcript_input /data/corpus/ \
    --language         eng           \
    --model_name       buckeye       \
    --device           cuda:0        \
    --output_folder    ./results/

Running the Bundled Examples

The repository ships with two ready-to-run examples inside inference/examples/:

inference/examples/
├── english.wav          # "The car is going too fast"
├── english.txt          # transcript (.txt format)
├── english.TextGrid     # same transcript (.TextGrid format)
├── german.wav           # "wer möchte keinen Kuchen"
└── german.txt           # transcript

Example 1 — English (.txt transcript)

python align_wav.py \
    --wav_input        inference/examples/english.wav \
    --transcript_input inference/examples/english.txt \
    --language         eng \
    --model_name       timit \
    --output_folder    results/

Example 2 — English (.TextGrid transcript)

python align_wav.py \
    --wav_input        inference/examples/english.wav \
    --transcript_input inference/examples/english.TextGrid \
    --language         eng \
    --model_name       buckeye \
    --output_folder    results/

Example 3 — German

python align_wav.py \
    --wav_input        inference/examples/german.wav \
    --transcript_input inference/examples/german.txt \
    --language         deu \
    --model_name       timit \
    --output_folder    results/

Expected output

After running any example, the results/ folder will contain:

results/
├── english.csv           # word-level timestamps
├── english.TextGrid      # Praat TextGrid with a "words" tier
└── english_graph1.png    # waveform + probability visualisation

english.csv looks like:

Word,Start_Time,End_Time
THE,0.0,0.12
CAR,0.12,0.31
IS,0.31,0.45
GOING,0.45,0.67
TOO,0.67,0.84
FAST,0.84,1.07

english.TextGrid can be opened directly in Praat and contains a words tier with one labelled interval per word.


Acknowledgements

This work was supported by NSF DRL Grant No. 2219843 and BSF Grant No. 2022618. We also thank Rob van Son for his guidance and support with the IFA Corpus.