Vocal Analysis - Bioacoustic M1/M2 Analysis

Portugues | English

Vocal Analysis - Bioacoustic M1/M2 Analysis

Computational analysis of laryngeal mechanisms (M1/M2) in Brazilian Choro recordings, focusing on the voice of Ademilde Fonseca.

Goal

Challenge the traditional "Fach" vocal classification system through a physiological analysis of laryngeal mechanisms. The pipeline extracts explainable bioacoustic features (f0, HNR, CPPS, spectral energy ratios) and classifies vocal mechanisms using four complementary methods: frequency threshold, GMM clustering, XGBoost with pseudo-labels, and the VMI (Vocal Mechanism Index) — a tessitura-agnostic continuous metric based on spectral features.

Stack

torchcrepe: SOTA f0 (pitch) extraction via CNN
parselmouth (Praat): HNR, CPPS, Jitter, Shimmer, Formants
numpy/scipy: Spectral features (Alpha Ratio, H1-H2, Spectral Tilt)
xgboost: Tabular M1/M2 classification with pseudo-labels
scikit-learn: GMM clustering for unsupervised mechanism detection
seaborn/matplotlib: Academic visualizations
google-generativeai: Multimodal narrative report generation (Gemini 2.0 Flash)

Setup

Requirements

Python 3.10+
UV (package manager)
(Optional) Google Gemini API Key for AI-powered reports

Installation

# Clone the repository
git clone <repo-url>
cd vocal-analysis

# Install dependencies
uv sync

# Install development dependencies (ruff, pytest, jupyter)
uv sync --extra dev

Configure Gemini (optional)

To generate AI-powered narrative reports:

Go to Google AI Studio
Click "Create API Key"
Set the environment variable:

export GEMINI_API_KEY=your_key_here

Or add it to your .bashrc/.zshrc to persist.

Using a .env file:

Copy the template:
```
cp .env.example .env
```
Edit .env with your settings
Load into environment:
```
source .env
```

Configure Report Language

The language of generated reports (analysis_report.md, vmi_analysis.md, llm_report.md) is controlled by the REPORT_LANG variable:

# English reports (default)
REPORT_LANG=en

# Portuguese reports
REPORT_LANG=pt-BR

Configure Excerpts (optional)

You can define specific time intervals for each song for analysis and validation plots. Useful for focusing on vocal passages without instrumental introductions.

In .env, use the format EXCERPT_<NAME>=MMSS-MMSS:

# From second 22 to 1:03
EXCERPT_DELICADO="0022-0103"

# From second 33 to 1:04
EXCERPT_BRASILEIRINHO="0033-0104"

# From second 7 to 23
EXCERPT_APANHEITE_CAVAQUINHO="0007-0023"

Excerpts are automatically used in validation plots (--validate-separation).

Project Structure

vocal-analysis/
├── src/vocal_analysis/
│   ├── preprocessing/
│   │   ├── audio.py              # load_audio(), normalize_audio()
│   │   ├── separation.py         # Source separation (HTDemucs)
│   │   └── process_ademilde.py   # Feature extraction script
│   ├── features/
│   │   ├── extraction.py         # Hybrid Crepe + Praat pipeline
│   │   ├── spectral.py           # Spectral features (Alpha Ratio, H1-H2, etc.)
│   │   ├── vmi.py                # Vocal Mechanism Index computation
│   │   └── articulation.py       # Articulatory agility features
│   ├── analysis/
│   │   ├── exploratory.py        # M1/M2 analysis, clustering
│   │   ├── run_analysis.py       # Full analysis script
│   │   └── llm_report.py         # Gemini-powered report generation
│   ├── modeling/
│   │   └── classifier.py         # XGBoost for M1/M2 classification
│   ├── scripts/
│   │   └── regenerate_validation_plot.py  # Regenerate plots without reprocessing
│   ├── visualization/
│   │   └── plots.py              # Academic plots
│   └── utils/
│       └── pitch.py              # Hz <-> Note conversion (A4, C5, etc.)
├── data/
│   ├── raw/                      # Original audio files (.mp3)
│   └── processed/                # CSVs, JSONs, logs
├── docs/
│   ├── en/                       # English documentation
│   └── pt-BR/                    # Portuguese documentation
├── outputs/
│   ├── plots/                    # Generated charts
│   └── models/                   # Trained models
└── tests/

Platform Usage

macOS

Quick validation: Use --use-praat-f0 to test the pipeline
Limitation: CREPE full may crash due to memory (32GB+ recommended)
Recommendation: Use macOS for validation only, process with CREPE on Colab/Windows

# Quick validation on macOS (Praat F0)
uv run python -m vocal_analysis.preprocessing.process_ademilde --use-praat-f0

Windows/Linux (32GB+ RAM)

Full processing: CREPE full with all features
GPU (NVIDIA): Use --device cuda for acceleration (~10x faster)

# Windows/Linux with CPU
uv run python -m vocal_analysis.preprocessing.process_ademilde

# Windows/Linux with NVIDIA GPU
uv run python -m vocal_analysis.preprocessing.process_ademilde --device cuda

Google Colab (Recommended!)

Free T4 GPU: ~12-15h/day of usage
Fast processing: 3 songs (~7min each) in ~10 minutes
Zero configuration: Environment ready to go

Full guide: docs/en/COLAB_SETUP.md

Quick start:

# In Colab with T4 GPU enabled
!git clone https://github.com/arthurcornelio88/vocal-analysis.git
%cd vocal-analysis
!pip install uv && uv pip install --system -e .

# Verify installation
!python -c "import vocal_analysis; print('Installed!')"

# Process with CREPE + GPU
!python src/vocal_analysis/preprocessing/process_ademilde.py --device cuda

Usage

1. Add audio files

Place MP3 files in data/raw/:

data/raw/
├── Apanhei-te Cavaquinho.mp3
├── brasileirinho.mp3
└── delicado.mp3

2. Process audio (extract features)

# Full processing with CREPE (requires GPU or 32GB+ RAM)
uv run python -m vocal_analysis.preprocessing.process_ademilde

# With GPU (Google Colab, Windows/Linux with NVIDIA)
uv run python -m vocal_analysis.preprocessing.process_ademilde --device cuda

# Fast mode with Praat (macOS, validation)
uv run python -m vocal_analysis.preprocessing.process_ademilde --use-praat-f0

Available options:

--device cuda: Use GPU (requires CUDA)
--use-praat-f0: Use Praat instead of CREPE (faster, less accurate)
--crepe-model {tiny,small,full}: Choose CREPE model (default: full)
--skip-formants: Skip formant extraction (~30% faster)
--skip-jitter-shimmer: Skip jitter/shimmer (~20% faster)
--skip-cpps: Skip CPPS (avoids macOS hang)
--skip-plots: Don't generate F0 plots
--limit N: Process only N files (useful for testing)
--fast: Enable all optimizations (Praat + no formants/jitter/shimmer/cpps/plots)
--no-separate-vocals: Disable source separation (HTDemucs). By default, vocal separation is enabled to improve pitch detection in complex arrangements

Generated outputs:

data/processed/ademilde_features.csv - Per-frame features
data/processed/ademilde_metadata.json - Structured metadata
data/processed/processing_log.md - Human-readable log
outputs/plots/*_f0.png - Pitch contours

2.1. Regenerate validation plots (without reprocessing)

If you've already processed the audio and just want to regenerate validation plots:

# List songs with available cache
uv run python -m vocal_analysis.scripts.regenerate_validation_plot

# Regenerate plot for a specific song
uv run python -m vocal_analysis.scripts.regenerate_validation_plot --song "Apanhei-te Cavaquinho"

# Regenerate all plots
uv run python -m vocal_analysis.scripts.regenerate_validation_plot --all

# Use CREPE instead of Praat (slower, more accurate)
uv run python -m vocal_analysis.scripts.regenerate_validation_plot --all --use-crepe

The script uses separated vocal data cached in data/cache/separated/ and excerpt intervals defined in .env.

3. Run exploratory analysis

uv run python -m vocal_analysis.analysis.run_analysis

The analysis script runs four classification methods in sequence:

Frequency threshold (400 Hz / G4) — simple binary split
GMM clustering — unsupervised discovery of M1/M2 clusters in f0 x HNR space
XGBoost — supervised classifier trained on GMM pseudo-labels with additional features
VMI (Vocal Mechanism Index) — continuous 0-1 metric based on spectral features (Alpha Ratio, H1-H2, CPPS, Spectral Tilt), independent of fixed frequency thresholds

Generated outputs:

outputs/plots/mechanism_analysis.png - Threshold-based M1/M2 analysis
outputs/plots/mechanism_clusters.png - GMM clustering visualization
outputs/plots/vmi_scatter.png - VMI distribution by spectral features
outputs/plots/xgb_mechanism_timeline.png - Temporal contour colored by XGBoost prediction
outputs/xgb_predictions.csv - Per-frame predictions (all methods)
outputs/analysis_report.md - Structured report with classification metrics
outputs/vmi_analysis.md - VMI report with per-song breakdown
outputs/llm_report.md - Narrative report with Gemini (if API configured)

Multimodal LLM Report

If GEMINI_API_KEY is configured, the script generates a narrative report using Gemini 2.0 Flash with:

Multimodal analysis: The LLM receives plots alongside numerical data
Clickable links: Graph references include markdown links (e.g., [brasileirinho_f0](plots/brasileirinho_f0.png))
Figure index: Complete list of visualizations at the end of the report

4. Generate LLM report only (optional)

To generate just the narrative report with Gemini (without re-running the full analysis):

uv run python -m vocal_analysis.analysis.llm_report

Optional parameters:

--metadata: Path to metadata file (default: data/processed/ademilde_metadata.json)
--stats: Path to M1/M2 statistics JSON (optional)
--output: Output path for the report (default: outputs/llm_report.md)
--plots-dir: Directory with PNG plots for multimodal analysis (default: outputs/plots/)
--lang: Report language (en or pt-BR, default: en)

Prerequisites:

Gemini API key configured: export GEMINI_API_KEY=your_key_here
Processed data files available (after running steps 1-3)
Generated plots (optional, for multimodal analysis)

Common errors:

"API key not valid": Check that the API key is correct and active at Google AI Studio
"quota exceeded": The free Gemini tier has usage limits. Wait for quota reset or upgrade.
Deprecation warning: The google-generativeai package is deprecated but still functional. Ignore the warning.

5. Extracted features

Core features (per-frame, from process_ademilde):

Column	Description
`time`	Timestamp in seconds
`f0`	Fundamental frequency (Hz)
`confidence`	Pitch estimation confidence (0-1)
`hnr`	Harmonic-to-Noise Ratio (dB)
`energy`	RMS energy
`f1, f2, f3, f4`	Formants 1-4 (Hz)
`song`	Song name
`cpps_global`	Cepstral Peak Prominence (global per song)
`jitter`	Jitter ppq5 - period instability (%)
`shimmer`	Shimmer apq11 - amplitude instability (%)

Spectral features (added by run_analysis if spectral data available):

Column	Description
`alpha_ratio`	Energy ratio 0-1 kHz vs 1-5 kHz (dB)
`h1_h2`	Difference between 1st and 2nd harmonic (glottal slope, dB)
`spectral_tilt`	Power spectrum slope (dB/octave)
`cpps_per_frame`	Cepstral Peak Prominence per frame

Classification and VMI (added by run_analysis):

Column	Description
`mechanism`	GMM label (M1/M2)
`xgb_mechanism`	XGBoost prediction (M1/M2)
`vmi`	Vocal Mechanism Index (0.0 - 1.0)
`vmi_label`	VMI category (M1_HEAVY, M1_LIGHT, MIX_PASSAGGIO, M2_REINFORCED, M2_LIGHT)
`f0_velocity`	Pitch change rate (Hz/s)
`f0_acceleration`	Pitch acceleration (Hz/s^2)
`syllable_rate`	Syllabic rate (syllables/s)

Utilities

Hz <-> Note Conversion

from vocal_analysis.utils.pitch import hz_to_note, note_to_hz

hz_to_note(440.0)        # "A4"
hz_to_note(261.63)       # "C4"
note_to_hz("G5")         # 783.99

Development

Linting

uv run ruff check src/
uv run ruff format src/

Tests

uv run pytest -v

Documentation

Bioacoustic Glossary — Accessible introduction to f0, HNR, formants, jitter, shimmer, VMI and why each matters
Methodology — Full technical reference: preprocessing, feature extraction, 4 classification methods, spectral features, VMI computation, data structure, limitations, and academic references
Colab Setup — Step-by-step guide for running on Google Colab with free GPU

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
data		data
docs		docs
outputs		outputs
src/vocal_analysis		src/vocal_analysis
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vocal Analysis - Bioacoustic M1/M2 Analysis

Goal

Stack

Setup

Requirements

Installation

Configure Gemini (optional)

Configure Report Language

Configure Excerpts (optional)

Project Structure

Platform Usage

macOS

Windows/Linux (32GB+ RAM)

Google Colab (Recommended!)

Usage

1. Add audio files

2. Process audio (extract features)

2.1. Regenerate validation plots (without reprocessing)

3. Run exploratory analysis

Multimodal LLM Report

4. Generate LLM report only (optional)

5. Extracted features

Utilities

Hz <-> Note Conversion

Development

Linting

Tests

Documentation

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vocal Analysis - Bioacoustic M1/M2 Analysis

Goal

Stack

Setup

Requirements

Installation

Configure Gemini (optional)

Configure Report Language

Configure Excerpts (optional)

Project Structure

Platform Usage

macOS

Windows/Linux (32GB+ RAM)

Google Colab (Recommended!)

Usage

1. Add audio files

2. Process audio (extract features)

2.1. Regenerate validation plots (without reprocessing)

3. Run exploratory analysis

Multimodal LLM Report

4. Generate LLM report only (optional)

5. Extracted features

Utilities

Hz <-> Note Conversion

Development

Linting

Tests

Documentation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages