Orpheus TTS Docker Deployment

English | 简体中文 | 繁體中文 | 日本語

Production-ready Docker deployment for Orpheus TTS with GPU management, multi-access modes, and optimized performance.

✨ Features

🐳 Docker Containerization: One-command deployment with CUDA 12.1 support
🎯 Intelligent GPU Management: Lazy loading + automatic unloading (1-hour timeout)
🌐 Three Access Modes: Web UI, REST API, and MCP (Model Context Protocol)
🚀 Optimized Performance: ~2.5s inference after model loading
🔒 Production Ready: Nginx reverse proxy with SSL support
🔐 Privacy Protection: All audio files saved to host /tmp/orpheus-tts, no data retained in container
🎨 Modern Web UI: Dark theme with Chinese/English toggle
📊 API Documentation: Built-in Swagger UI
🎤 8 Voice Options: tara, leah, jess, leo, dan, mia, zac, zoe

🎯 Model Information

v2.0.0 (Current - AWQ 4-bit Quantized)

Model: Hariprasath28/orpheus-3b-4bit-AWQ
Quantization: AWQ 4-bit
Precision: float16
Parameters: 3B (3 billion)
Model Weights: 2.30GB (62% reduction from bfloat16)
VRAM Usage: ~31.5GB (model 2.30GB + KV cache 27.42GB)
Performance:
- Model preload: ~50s (on startup)
- Generation: ~1.4s per request
- Streaming latency: ~200ms

v1.5.0 (bfloat16 Full Precision)

Model: canopylabs/orpheus-3b-0.1-ft
Precision: bfloat16 (full precision)
Parameters: 3B (3 billion)
Model Weights: 6.18GB
VRAM Usage: ~29.8GB (with preloading)
Performance:
- Model preload: ~47s (on startup)
- Generation: ~2.5s per request

🚀 Quick Start

Prerequisites

Docker 20.10+ with nvidia-docker2
NVIDIA GPU with 40GB+ VRAM (e.g., L40S, A100)
CUDA 12.1+ compatible driver
HuggingFace account with access to orpheus-3b-0.1-ft

Method 1: Docker Run (Fastest)

# Set your HuggingFace token
export HF_TOKEN=your_huggingface_token

# Pull and run (v2.0.0 with AWQ 4-bit quantization)
docker pull neosun/orpheus-tts:v2.0.0-allinone

docker run -d \
  --name orpheus-tts \
  --gpus '"device=0"' \
  -p 8899:8899 \
  -e HF_TOKEN=$HF_TOKEN \
  -v /tmp/orpheus-tts:/app/outputs \
  --restart unless-stopped \
  neosun/orpheus-tts:v2.0.0-allinone

# Wait for service to start (~30 seconds)
sleep 30

# Check health
curl http://localhost:8899/health

Method 2: Docker Compose (Recommended)

Clone the repository:

git clone https://github.com/neosun100/orpheus-tts-docker.git
cd orpheus-tts-docker

Create .env file:

cp .env.example .env
# Edit .env and set your HF_TOKEN

Start the service:

docker compose up -d

Verify:

# Check container status
docker compose ps

# Check health
curl http://localhost:8899/health

📖 Usage

Web UI

Open your browser and navigate to:

http://localhost:8899

Features:

Text input with voice selection
Real-time audio generation
Download generated audio
Dark theme with language toggle
API Documentation link (📖 API Docs in header)

REST API

Interactive API Documentation

Swagger UI (recommended for testing):

http://localhost:8899/apidocs/

OpenAPI Specification:

http://localhost:8899/apispec_1.json

Complete API Guide: See docs/API_GUIDE.md

Generate Speech

curl -X POST http://localhost:8899/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello world, this is a test.",
    "voice": "tara",
    "model_size": "medium"
  }' \
  --output output.wav

API Documentation

Interactive Swagger UI available at:

http://localhost:8899/docs

Available Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/api/generate`	POST	Generate speech
`/api/voices`	GET	List available voices
`/api/models`	GET	List available models
`/gpu/status`	GET	GPU status
`/gpu/offload`	POST	Offload model from GPU

MCP (Model Context Protocol)

For AI assistants and automation tools:

{
  "mcpServers": {
    "orpheus-tts": {
      "command": "docker",
      "args": ["exec", "-i", "orpheus-tts", "python", "/app/mcp_server.py"]
    }
  }
}

Available MCP tools:

generate_speech: Generate speech from text
get_gpu_status: Check GPU memory usage
offload_gpu: Free GPU memory
list_models: List available models

⚙️ Configuration

Environment Variables

Variable	Default	Description
`PORT`	8899	Service port
`GPU_IDLE_TIMEOUT`	3600	Model unload timeout (seconds)
`NVIDIA_VISIBLE_DEVICES`	0	GPU device ID
`HF_TOKEN`	-	HuggingFace token (required)

docker-compose.yml

version: '3.8'

services:
  orpheus-tts:
    image: neosun/orpheus-tts:v2.0.0-allinone
    container_name: orpheus-tts
    environment:
      - PORT=${PORT:-8899}
      - GPU_IDLE_TIMEOUT=${GPU_IDLE_TIMEOUT:-3600}
      - HF_TOKEN=${HF_TOKEN}
    ports:
      - "0.0.0.0:${PORT:-8899}:${PORT:-8899}"
    volumes:
      - /tmp/orpheus-tts:/app/outputs
      - huggingface_cache:/root/.cache/huggingface
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['${NVIDIA_VISIBLE_DEVICES:-0}']
              capabilities: [gpu]

volumes:
  huggingface_cache:

📁 Project Structure

orpheus-tts-docker/
├── Dockerfile              # Container definition
├── docker-compose.yml      # Orchestration config
├── server.py              # Flask web server
├── mcp_server.py          # MCP interface
├── gpu_manager.py         # GPU management
├── requirements.txt       # Python dependencies
├── .env.example           # Environment template
├── outputs/               # Generated audio files
└── docs/                  # Documentation
    ├── ARCHITECTURE.md
    ├── DOCKER_DEPLOYMENT.md
    ├── MCP_GUIDE.md
    └── QUANTIZED_MODELS.md

🛠️ Tech Stack

Base: Python 3.10, CUDA 12.1
ML Framework: PyTorch 2.5.1, vLLM 0.7.3
Web Framework: Flask 3.0.0
Model: Orpheus TTS (canopylabs/orpheus-3b-0.1-ft)
Container: Docker, Docker Compose
GPU: NVIDIA CUDA with nvidia-docker2

🔧 Advanced Usage

Custom GPU Selection

# Use GPU 2
docker run -d \
  --gpus '"device=2"' \
  -e NVIDIA_VISIBLE_DEVICES=2 \
  neosun/orpheus-tts:v1.0.0-allinone

Adjust Memory Usage

Edit server.py to change gpu_memory_utilization:

def load_model(model_name):
    return OrpheusModel(
        model_name=MODEL_CONFIGS[model_name], 
        max_model_len=2048,
        gpu_memory_utilization=0.6  # Reduce from 0.7 to 0.6
    )

Production Deployment with Nginx

See DOCKER_DEPLOYMENT.md for Nginx reverse proxy setup with SSL.

📊 Performance Benchmarks

Metric	Value
First Request	~48 seconds
Subsequent Requests	~2.5 seconds
Streaming Latency	~200ms
Concurrent Requests	148.42x (2048 tokens)
VRAM Usage	~39GB
Model Loading Time	~15 seconds

🐛 Troubleshooting

CUDA Out of Memory

Check GPU availability:

nvidia-smi

Reduce memory usage:

Lower gpu_memory_utilization to 0.6 or 0.5
Reduce max_model_len to 1024

HuggingFace Access Denied

Request access at: https://huggingface.co/canopylabs/orpheus-3b-0.1-ft
Verify your token: https://huggingface.co/settings/tokens
Ensure token has read permissions

Container Won't Start

# Check logs
docker logs orpheus-tts

# Check GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 Changelog

v2.0.0 (2025-12-14)

✅ AWQ 4-bit quantization (62% model weight reduction)
✅ Model preloading on startup (~50s load time)
✅ Fast generation: 1.4s per request
✅ Model weights: 2.30GB (vs 6.18GB bfloat16)
✅ VRAM usage: 31.5GB (model 2.30GB + KV cache 27.42GB)
✅ Privacy protection: host volume mount /tmp/orpheus-tts
✅ Docker Hub image: neosun/orpheus-tts:v2.0.0-allinone
✅ Digest: sha256:686a55ef49a607bad0ba2bda472cb54cb5846af3609b2b8f2bfd2a251546f077

v1.5.0 (2025-12-14)

✅ Model preloading on startup (26x faster first request)
✅ Zero-shot voice cloning UI with file upload
✅ Generation timing display (model load, generation, total)
✅ Privacy protection: host volume mount /tmp/orpheus-tts
✅ Performance: 3.7s generation (was 48s in v1.0)
✅ Memory optimization: 29.8GB VRAM (was 39GB)
✅ Docker Hub image: neosun/orpheus-tts:v1.5.0-allinone

v1.0.0 (2025-12-13)

✅ Initial Docker deployment
✅ GPU management with lazy loading
✅ Three access modes (Web UI, REST API, MCP)
✅ Nginx reverse proxy support
✅ Performance optimization (gpu_memory_utilization=0.7)
✅ Docker Hub image: neosun/orpheus-tts:v1.0.0-allinone

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
additional_inference_options		additional_inference_options
docs		docs
finetune		finetune
orpheus_tts_pypi		orpheus_tts_pypi
pretrain		pretrain
realtime_streaming_example		realtime_streaming_example
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DEPLOYMENT_CHECKLIST.md		DEPLOYMENT_CHECKLIST.md
DOCKER_DEPLOYMENT.md		DOCKER_DEPLOYMENT.md
DOCKER_IMAGES.md		DOCKER_IMAGES.md
Dockerfile		Dockerfile
Dockerfile.v2		Dockerfile.v2
FINAL_REPORT.md		FINAL_REPORT.md
FINAL_TEST_REPORT.md		FINAL_TEST_REPORT.md
GITHUB_DEPLOYMENT_REPORT.md		GITHUB_DEPLOYMENT_REPORT.md
HF_TOKEN_SETUP.md		HF_TOKEN_SETUP.md
LICENSE		LICENSE
MCP_GUIDE.md		MCP_GUIDE.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUANTIZED_MODELS.md		QUANTIZED_MODELS.md
QUICK_START.md		QUICK_START.md
README.md		README.md
README_CN.md		README_CN.md
README_DOCKER.md		README_DOCKER.md
README_JP.md		README_JP.md
README_TW.md		README_TW.md
RELEASE_v1.0.0.md		RELEASE_v1.0.0.md
RELEASE_v1.5.0.md		RELEASE_v1.5.0.md
RELEASE_v2.0.0.md		RELEASE_v2.0.0.md
TEST_REPORT.md		TEST_REPORT.md
TEST_REPORT_v1.5.0.md		TEST_REPORT_v1.5.0.md
demo.mp4		demo.mp4
docker-compose.v2.yml		docker-compose.v2.yml
docker-compose.yml		docker-compose.yml
emotions.txt		emotions.txt
gpu_manager.py		gpu_manager.py
mcp_config.json		mcp_config.json
mcp_server.py		mcp_server.py
requirements.txt		requirements.txt
run_v1.0.0.sh		run_v1.0.0.sh
server.py		server.py
server_v1.5.py		server_v1.5.py
server_v2.py		server_v2.py
start.sh		start.sh
start_v1.5.sh		start_v1.5.sh
test_deployment.sh		test_deployment.sh
test_dockerhub.sh		test_dockerhub.sh
test_fixes.sh		test_fixes.sh

Folders and files

Latest commit

History

Repository files navigation

Orpheus TTS Docker Deployment

✨ Features

🎯 Model Information

v2.0.0 (Current - AWQ 4-bit Quantized)

v1.5.0 (bfloat16 Full Precision)

🚀 Quick Start

Prerequisites

Method 1: Docker Run (Fastest)

Method 2: Docker Compose (Recommended)

📖 Usage

Web UI

REST API

Interactive API Documentation

Generate Speech

API Documentation

Available Endpoints

MCP (Model Context Protocol)

⚙️ Configuration

Environment Variables

docker-compose.yml

📁 Project Structure

🛠️ Tech Stack

🔧 Advanced Usage

Custom GPU Selection

Adjust Memory Usage

Production Deployment with Nginx

📊 Performance Benchmarks

🐛 Troubleshooting

CUDA Out of Memory

HuggingFace Access Denied

Container Won't Start

🤝 Contributing

📝 Changelog

v2.0.0 (2025-12-14)

v1.5.0 (2025-12-14)

v1.0.0 (2025-12-13)

📄 License

🙏 Acknowledgments

⭐ Star History

📱 Follow Us

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages