Production-ready Docker deployment for Orpheus TTS with GPU management, multi-access modes, and optimized performance.
- 🐳 Docker Containerization: One-command deployment with CUDA 12.1 support
- 🎯 Intelligent GPU Management: Lazy loading + automatic unloading (1-hour timeout)
- 🌐 Three Access Modes: Web UI, REST API, and MCP (Model Context Protocol)
- 🚀 Optimized Performance: ~2.5s inference after model loading
- 🔒 Production Ready: Nginx reverse proxy with SSL support
- 🔐 Privacy Protection: All audio files saved to host
/tmp/orpheus-tts, no data retained in container - 🎨 Modern Web UI: Dark theme with Chinese/English toggle
- 📊 API Documentation: Built-in Swagger UI
- 🎤 8 Voice Options: tara, leah, jess, leo, dan, mia, zac, zoe
- Model: Hariprasath28/orpheus-3b-4bit-AWQ
- Quantization: AWQ 4-bit
- Precision: float16
- Parameters: 3B (3 billion)
- Model Weights: 2.30GB (62% reduction from bfloat16)
- VRAM Usage: ~31.5GB (model 2.30GB + KV cache 27.42GB)
- Performance:
- Model preload: ~50s (on startup)
- Generation: ~1.4s per request
- Streaming latency: ~200ms
- Model: canopylabs/orpheus-3b-0.1-ft
- Precision: bfloat16 (full precision)
- Parameters: 3B (3 billion)
- Model Weights: 6.18GB
- VRAM Usage: ~29.8GB (with preloading)
- Performance:
- Model preload: ~47s (on startup)
- Generation: ~2.5s per request
- Docker 20.10+ with nvidia-docker2
- NVIDIA GPU with 40GB+ VRAM (e.g., L40S, A100)
- CUDA 12.1+ compatible driver
- HuggingFace account with access to orpheus-3b-0.1-ft
# Set your HuggingFace token
export HF_TOKEN=your_huggingface_token
# Pull and run (v2.0.0 with AWQ 4-bit quantization)
docker pull neosun/orpheus-tts:v2.0.0-allinone
docker run -d \
--name orpheus-tts \
--gpus '"device=0"' \
-p 8899:8899 \
-e HF_TOKEN=$HF_TOKEN \
-v /tmp/orpheus-tts:/app/outputs \
--restart unless-stopped \
neosun/orpheus-tts:v2.0.0-allinone
# Wait for service to start (~30 seconds)
sleep 30
# Check health
curl http://localhost:8899/health- Clone the repository:
git clone https://github.com/neosun100/orpheus-tts-docker.git
cd orpheus-tts-docker- Create
.envfile:
cp .env.example .env
# Edit .env and set your HF_TOKEN- Start the service:
docker compose up -d- Verify:
# Check container status
docker compose ps
# Check health
curl http://localhost:8899/healthOpen your browser and navigate to:
http://localhost:8899
Features:
- Text input with voice selection
- Real-time audio generation
- Download generated audio
- Dark theme with language toggle
- API Documentation link (📖 API Docs in header)
Swagger UI (recommended for testing):
http://localhost:8899/apidocs/
OpenAPI Specification:
http://localhost:8899/apispec_1.json
Complete API Guide: See docs/API_GUIDE.md
curl -X POST http://localhost:8899/api/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Hello world, this is a test.",
"voice": "tara",
"model_size": "medium"
}' \
--output output.wavInteractive Swagger UI available at:
http://localhost:8899/docs
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/api/generate |
POST | Generate speech |
/api/voices |
GET | List available voices |
/api/models |
GET | List available models |
/gpu/status |
GET | GPU status |
/gpu/offload |
POST | Offload model from GPU |
For AI assistants and automation tools:
{
"mcpServers": {
"orpheus-tts": {
"command": "docker",
"args": ["exec", "-i", "orpheus-tts", "python", "/app/mcp_server.py"]
}
}
}Available MCP tools:
generate_speech: Generate speech from textget_gpu_status: Check GPU memory usageoffload_gpu: Free GPU memorylist_models: List available models
| Variable | Default | Description |
|---|---|---|
PORT |
8899 | Service port |
GPU_IDLE_TIMEOUT |
3600 | Model unload timeout (seconds) |
NVIDIA_VISIBLE_DEVICES |
0 | GPU device ID |
HF_TOKEN |
- | HuggingFace token (required) |
version: '3.8'
services:
orpheus-tts:
image: neosun/orpheus-tts:v2.0.0-allinone
container_name: orpheus-tts
environment:
- PORT=${PORT:-8899}
- GPU_IDLE_TIMEOUT=${GPU_IDLE_TIMEOUT:-3600}
- HF_TOKEN=${HF_TOKEN}
ports:
- "0.0.0.0:${PORT:-8899}:${PORT:-8899}"
volumes:
- /tmp/orpheus-tts:/app/outputs
- huggingface_cache:/root/.cache/huggingface
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['${NVIDIA_VISIBLE_DEVICES:-0}']
capabilities: [gpu]
volumes:
huggingface_cache:orpheus-tts-docker/
├── Dockerfile # Container definition
├── docker-compose.yml # Orchestration config
├── server.py # Flask web server
├── mcp_server.py # MCP interface
├── gpu_manager.py # GPU management
├── requirements.txt # Python dependencies
├── .env.example # Environment template
├── outputs/ # Generated audio files
└── docs/ # Documentation
├── ARCHITECTURE.md
├── DOCKER_DEPLOYMENT.md
├── MCP_GUIDE.md
└── QUANTIZED_MODELS.md
- Base: Python 3.10, CUDA 12.1
- ML Framework: PyTorch 2.5.1, vLLM 0.7.3
- Web Framework: Flask 3.0.0
- Model: Orpheus TTS (canopylabs/orpheus-3b-0.1-ft)
- Container: Docker, Docker Compose
- GPU: NVIDIA CUDA with nvidia-docker2
# Use GPU 2
docker run -d \
--gpus '"device=2"' \
-e NVIDIA_VISIBLE_DEVICES=2 \
neosun/orpheus-tts:v1.0.0-allinoneEdit server.py to change gpu_memory_utilization:
def load_model(model_name):
return OrpheusModel(
model_name=MODEL_CONFIGS[model_name],
max_model_len=2048,
gpu_memory_utilization=0.6 # Reduce from 0.7 to 0.6
)See DOCKER_DEPLOYMENT.md for Nginx reverse proxy setup with SSL.
| Metric | Value |
|---|---|
| First Request | ~48 seconds |
| Subsequent Requests | ~2.5 seconds |
| Streaming Latency | ~200ms |
| Concurrent Requests | 148.42x (2048 tokens) |
| VRAM Usage | ~39GB |
| Model Loading Time | ~15 seconds |
- Check GPU availability:
nvidia-smi- Reduce memory usage:
- Lower
gpu_memory_utilizationto 0.6 or 0.5 - Reduce
max_model_lento 1024
- Request access at: https://huggingface.co/canopylabs/orpheus-3b-0.1-ft
- Verify your token: https://huggingface.co/settings/tokens
- Ensure token has read permissions
# Check logs
docker logs orpheus-tts
# Check GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smiContributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- ✅ AWQ 4-bit quantization (62% model weight reduction)
- ✅ Model preloading on startup (~50s load time)
- ✅ Fast generation: 1.4s per request
- ✅ Model weights: 2.30GB (vs 6.18GB bfloat16)
- ✅ VRAM usage: 31.5GB (model 2.30GB + KV cache 27.42GB)
- ✅ Privacy protection: host volume mount
/tmp/orpheus-tts - ✅ Docker Hub image: neosun/orpheus-tts:v2.0.0-allinone
- ✅ Digest: sha256:686a55ef49a607bad0ba2bda472cb54cb5846af3609b2b8f2bfd2a251546f077
- ✅ Model preloading on startup (26x faster first request)
- ✅ Zero-shot voice cloning UI with file upload
- ✅ Generation timing display (model load, generation, total)
- ✅ Privacy protection: host volume mount
/tmp/orpheus-tts - ✅ Performance: 3.7s generation (was 48s in v1.0)
- ✅ Memory optimization: 29.8GB VRAM (was 39GB)
- ✅ Docker Hub image: neosun/orpheus-tts:v1.5.0-allinone
- ✅ Initial Docker deployment
- ✅ GPU management with lazy loading
- ✅ Three access modes (Web UI, REST API, MCP)
- ✅ Nginx reverse proxy support
- ✅ Performance optimization (gpu_memory_utilization=0.7)
- ✅ Docker Hub image: neosun/orpheus-tts:v1.0.0-allinone
This project is licensed under the MIT License - see the LICENSE file for details.
- Canopy Labs for the amazing Orpheus TTS model
- vLLM for efficient inference
- Original Orpheus TTS: https://github.com/canopyai/Orpheus-TTS
Made with ❤️ by the community
