Music Video Automator 🎵🎬

An agentic pipeline that transforms a simple text prompt into a fully synchronized, structurally intelligent music video using high-performance open-source AI models.

🌟 Features

Agentic Orchestration: Uses Qwen 3.6-35B as a "Director" to plan lyrics and scenes.
Smart Music Generation: ACE-Step 1.5 generates the master audio track.
Perfect Synchronization: Qwen3-ASR extracts timestamps to align visuals with the rhythm.
Dynamic Video Production: LTX-2.3 generates video clips conditioned on audio for perfect beat-matching.
Interactive Web UI: A stylish, row-based "vertical timeline" that allows you to edit and regenerate from any stage.
High Efficiency: GGUF-based models allow the entire pipeline to run on consumer-grade hardware (e.g., RTX 3090/4090).

🚀 Quick Start

1. Prerequisites

NVIDIA GPU with 24GB+ VRAM recommended.
uv installed for fast Python management.

2. Setup

# Clone the repository and install dependencies
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

3. Launch the API

The backend handles model loading, persistence, and the execution pipeline.

python -m uvicorn app.main:app --host 0.0.0.0 --port 8000

4. Access the UI

Open your browser to: http://127.0.0.1:8000/static/index.html

🛠 Usage

Using the Web Interface

Enter Prompt: Provide a Topic and a Musical Style.
Monitor: Watch the vertical timeline as the AI "Brain," "Ear," and "Eye" complete each stage.
Iterate:
- Don't like the lyrics? Click Edit, change them, and hit Regenerate From Here.
- The system will only restart the necessary downstream steps (Song -> Video).

Using the CLI

You can also trigger projects from the command line:

./.venv/bin/python cli.py start "A space odyssey" "Cinematic synthwave" --quality low

🏗 Pipeline Stages

Lyrics Generation: Qwen 3.6-35B writes themed lyrics with structural markers.
Music Composition: ACE-Step 1.5 plans and synthesizes a master WAV file.
Timestamp Extraction: Qwen3-ASR aligns the song and lyrics with millisecond precision.
Scene Breakdown: The Director groups timestamps into descriptive visual chapters.
Video Production: LTX-2.3 produces clips of varying lengths (8n+1 frames) conditioned on the audio's rhythm.
Final Mastering: FFmpeg merges scenes and muxes the high-fidelity master audio.

📦 Model Stack (GGUF)

All models are automatically downloaded from Hugging Face on first run:

Lyrics/Director: unsloth/Qwen3.6-35B-A3B-GGUF
Music: Serveurperso/ACE-Step-1.5-GGUF
Video: unsloth/LTX-2.3-GGUF
ASR: Qwen/Qwen3-ASR-1.7B

📝 Documentation

status.md: Project overview and completed features.
problems.md: Current limitations and known issues.
improvements.md: Future roadmap and optimizations.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
app		app
static		static
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
improvements.md		improvements.md
install_cuda.sh		install_cuda.sh
install_rocm.sh		install_rocm.sh
plan.md		plan.md
problems.md		problems.md
requirements.txt		requirements.txt
start_server.sh		start_server.sh
status.md		status.md
web-plan.md		web-plan.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Music Video Automator 🎵🎬

🌟 Features

🚀 Quick Start

1. Prerequisites

2. Setup

3. Launch the API

4. Access the UI

🛠 Usage

Using the Web Interface

Using the CLI

🏗 Pipeline Stages

📦 Model Stack (GGUF)

📝 Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Music Video Automator 🎵🎬

🌟 Features

🚀 Quick Start

1. Prerequisites

2. Setup

3. Launch the API

4. Access the UI

🛠 Usage

Using the Web Interface

Using the CLI

🏗 Pipeline Stages

📦 Model Stack (GGUF)

📝 Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages