🚀 Chatterbox-TTS-Extended — All Features & Technical Explanations

Chatterbox-TTS-Extended is a power-user TTS pipeline for advanced single and batch speech synthesis, voice conversion, and artifact-reduced audio generation. It is based on Chatterbox-TTS, but adds:

Multi-file input & batch output
Custom candidate generation & validation
Rich audio post-processing
Whisper/faster-whisper validation
Voice conversion (VC) tab
Full-featured persistent UI with parallelism and artifact reduction
Optional audio denoising with pyrnnoise - removes most artifacts

📋 Table of Contents

Feature Summary Table
Text Input & File Handling
Reference Audio
Voice/Emotion/Synthesis Controls
Batching, Chunking & Grouping
Text Preprocessing
Audio Post-Processing
Export & Output Options
Generation Logic & Quality Control
Whisper Sync & Validation
Parallel Processing & Performance
Persistent Settings & UI
🎙️ Voice Conversion (VC) Tab
Tips & Troubleshooting
Installation
Feedback & Contributions
Known Bugs

Feature Summary Table

Feature	UI Exposed?	Script Logic
Text input (box + multi-file upload)	✔	Yes
Reference audio (conditioning)	✔	Yes
Separate/merge file output	✔	Yes
Emotion, CFG, temperature, seed	✔	Yes
Batch/smart-append/split (sentences)	✔	Yes
Sound word remove/replace	✔	Yes
Inline reference number removal	✔	Yes
Dot-letter ("J.R.R.") correction	✔	Yes
Lowercase & whitespace normalization	✔	Yes
Auto-Editor post-processing	✔	Yes
pyrnnoise denoising (RNNoise)	✔	Yes
FFmpeg normalization (EBU/peak)	✔	Yes
WAV/MP3/FLAC export	✔	Yes
Candidates per chunk, retries, fallback	✔	Yes
Parallelism (workers)	✔	Yes
Whisper/faster-whisper backend	✔	Yes
Persistent settings (JSON/CSV per output)	✔	Yes
Settings load/save in UI	✔	Yes
Audio preview & download	✔	Yes
Help/Instructions	✔ (Accordion)	Yes
Voice Conversion (VC tab)	✔	Yes

Text Input & File Handling

Text box: For direct text entry (single or multi-line).
Multi-file upload: Drag-and-drop any number of .txt files.
- Choose to merge them into one audio or process each as a separate output (toggle in the UI).
- Outputs are named for sorting and reproducibility.
Reference audio input: Upload or record a sample to condition the generated voice.
Settings file support: Load or save all UI settings as JSON for easy workflow repeatability.

Reference Audio

Voice Prompt (Conditioning):
- Upload or record an audio reference.
- The TTS engine mimics the style, timbre, or emotion from the provided sample.
- Handles missing/invalid reference audio gracefully.

Voice/Emotion/Synthesis Controls

Emotion exaggeration: Slider (0 = flat/neutral, 1 = normal, 2 = exaggerated emotion).
CFG Weight/Pace: Controls strictness and speech pacing. High = literal, monotone. Low = expressive, dynamic.
Temperature: Controls voice randomness/variety.
Random seed: 0 = new random each run. Any number = repeatable generations.

Batching, Chunking & Grouping

Sentence batching: Groups sentences up to ~300 characters per chunk (adjustable in code).
Smart-append short sentences: When batching is off, merges very short sentences for smoother prosody.
Recursive long sentence splitting: Automatically splits long sentences at ; : - , or by character count.
Parallel chunk processing: Multiple chunks are generated at once for speed (user control).

Text Preprocessing

Lowercase conversion: Makes all text lowercase (optional).
Whitespace normalization: Strips extra spaces/newlines.
Dot-letter fix: Converts "J.R.R." to "J R R" to improve initialisms and names.
Inline reference number removal: Automatically removes numbers after sentence-ending punctuation (e.g., .188 or .”3).
Sound word removal/replacement: Configurable list for unwanted noises or phrases, e.g., um, ahh, or mappings like zzz=>sigh.
- Handles standalone words, possessives, quoted patterns, and dash/punctuation-only removals.

Audio Post-Processing

pyrnnoise Denoising (RNNoise):
- Optional toggle for almost 100% artifact removal.
- Runs before Auto-Editor and normalization.
- Uses the denoise CLI if available; otherwise falls back to the Python API.
- Temporary conversion to 48 kHz mono s16 for best compatibility; output is restored to original sample rate.
Auto-Editor integration:
- Trims silences/stutters/artifacts after generation.
- Threshold and margin are adjustable in UI.
- Option to keep original WAV before cleanup.
FFmpeg normalization:
- EBU R128: Target loudness, true peak, dynamic range.
- Peak: Quick normalization to prevent clipping.
- All normalization parameters are user-adjustable.

Export & Output Options

Multiple audio formats: WAV (uncompressed), MP3 (320k), FLAC (lossless). Any/all selectable in UI.
Output file naming: Each output includes base name, timestamp, generation, and seed for tracking.
Batch export: If “separate files” is checked, each uploaded text file gets its own processed output.
Disable watermarking: Optional toggle to disable watermarking during generation.

Generation Logic & Quality Control

Number of generations: Produce multiple different outputs at once (“takes”).
Candidates per chunk: For each chunk, generate multiple variants.
Max attempts per candidate: If validation fails, retry up to N times.
Deterministic seeding: A per-chunk/per-candidate/per-attempt seed is derived from the base seed for reproducibility.
Fallback strategies: If all candidates fail validation, use the longest transcript or highest similarity score.

Whisper Sync & Validation

Backends: Choose OpenAI Whisper or faster-whisper (SYSTRAN), with multiple model sizes (VRAM vs. speed tradeoff).
Per-chunk validation: Each audio chunk is transcribed and compared to its intended text.
Bypass option: Skip Whisper entirely (faster, but may allow more TTS errors).
Use-longest-transcript-on-fail: Optional fallback if no candidate passes validation.

Parallel Processing & Performance

Full parallelism: User-configurable worker count (default 4).
Worker control: Set to 1 for low-memory or debugging, higher for speed.
VRAM management: Clears Whisper model and GPU cache after validation to avoid leaks.

Persistent Settings & UI

JSON settings: UI choices are saved/restored automatically; import/export supported.
Per-output settings artifacts: Each output also writes a .settings.json and .settings.csv capturing all parameters and output filenames.
Complete Gradio UI: All options available as toggles, sliders, dropdowns, checkboxes, and file pickers.
Audio preview & download: Listen to or download any generated output from the UI.
Help/Instructions: Accordion with detailed guidance for each setting.

🎙️ Voice Conversion (VC) Tab

Convert any voice to sound like another!

The Voice Conversion tab lets you:

Upload or record the input audio (the voice to convert).
Upload or record the target/reference voice (the voice to match).
Adjust pitch (optional)
Click Run Voice Conversion — get a new audio file with the same words but the target voice!

Technical highlights:

Handles long audio by splitting into overlapping chunks and recombining with crossfades.
Output matches the model’s sample rate and fidelity.
Automatic chunking and processing—no manual intervention needed.
Pitch shift control.
Option to disable watermarking.

Tips & Troubleshooting

Background noise in output?
- Enable pyrnnoise denoising in the UI to clean up artifacts.
- Denoising runs before Auto-Editor and normalization for best results.
Out of VRAM or slow?
- Lower parallel workers, pick a smaller/faster Whisper model, reduce candidates.
Artifacts/Errors?
- Increase candidates/retries, adjust Auto-Editor threshold/margin, refine sound word replacements.
Choppy audio?
- Increase Auto-Editor margin; lower threshold.
Reproducibility
- Set a fixed random seed.

📝 Installation

Requires Python 3.10.x and FFmpeg (on PATH).

Clone the repo:

git clone https://github.com/petermg/Chatterbox-TTS-Extended

Install requirements:

pip install --force-reinstall -r requirements.txt
# If needed, try requirements.base.with.versions.txt or requirements_frozen.txt

Run:

# Use your repo's main file. For example:
python Chatter.py
# or, if your file is named like this branch:
python zChatter.py

If FFmpeg isn’t in your PATH, place the executable alongside the script or add it to PATH.

📣 Feedback & Contributions

Open an issue or pull request for suggestions, bug reports, or improvements!

Known Bugs:

It seems if you use fasterwhisper for validation, sometimes it just silently crashes. Apparently this has to do with using the fasterwhisper model. It's not actually the python code. So if you are experiencing this, switch back to the original WhisperSync model. UPDATE: with the latest update this bug may have been resolved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Chatterbox-TTS-Extended — All Features & Technical Explanations

📋 Table of Contents

Feature Summary Table

Text Input & File Handling

Reference Audio

Voice/Emotion/Synthesis Controls

Batching, Chunking & Grouping

Text Preprocessing

Audio Post-Processing

Export & Output Options

Generation Logic & Quality Control

Whisper Sync & Validation

Parallel Processing & Performance

Persistent Settings & UI

🎙️ Voice Conversion (VC) Tab

Tips & Troubleshooting

📝 Installation

📣 Feedback & Contributions

Known Bugs:

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🚀 Chatterbox-TTS-Extended — All Features & Technical Explanations

📋 Table of Contents

Feature Summary Table

Text Input & File Handling

Reference Audio

Voice/Emotion/Synthesis Controls

Batching, Chunking & Grouping

Text Preprocessing

Audio Post-Processing

Export & Output Options

Generation Logic & Quality Control

Whisper Sync & Validation

Parallel Processing & Performance

Persistent Settings & UI

🎙️ Voice Conversion (VC) Tab

Tips & Troubleshooting

📝 Installation

📣 Feedback & Contributions

Known Bugs: