An agentic pipeline that transforms a simple text prompt into a fully synchronized, structurally intelligent music video using high-performance open-source AI models.
- Agentic Orchestration: Uses
Qwen 3.6-35Bas a "Director" to plan lyrics and scenes. - Smart Music Generation:
ACE-Step 1.5generates the master audio track. - Perfect Synchronization:
Qwen3-ASRextracts timestamps to align visuals with the rhythm. - Dynamic Video Production:
LTX-2.3generates video clips conditioned on audio for perfect beat-matching. - Interactive Web UI: A stylish, row-based "vertical timeline" that allows you to edit and regenerate from any stage.
- High Efficiency: GGUF-based models allow the entire pipeline to run on consumer-grade hardware (e.g., RTX 3090/4090).
- NVIDIA GPU with 24GB+ VRAM recommended.
- uv installed for fast Python management.
# Clone the repository and install dependencies
uv venv
source .venv/bin/activate
uv pip install -r requirements.txtThe backend handles model loading, persistence, and the execution pipeline.
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000Open your browser to:
http://127.0.0.1:8000/static/index.html
- Enter Prompt: Provide a Topic and a Musical Style.
- Monitor: Watch the vertical timeline as the AI "Brain," "Ear," and "Eye" complete each stage.
- Iterate:
- Don't like the lyrics? Click Edit, change them, and hit Regenerate From Here.
- The system will only restart the necessary downstream steps (Song -> Video).
You can also trigger projects from the command line:
./.venv/bin/python cli.py start "A space odyssey" "Cinematic synthwave" --quality low- Lyrics Generation:
Qwen 3.6-35Bwrites themed lyrics with structural markers. - Music Composition:
ACE-Step 1.5plans and synthesizes a master WAV file. - Timestamp Extraction:
Qwen3-ASRaligns the song and lyrics with millisecond precision. - Scene Breakdown: The Director groups timestamps into descriptive visual chapters.
- Video Production:
LTX-2.3produces clips of varying lengths(8n+1 frames)conditioned on the audio's rhythm. - Final Mastering:
FFmpegmerges scenes and muxes the high-fidelity master audio.
All models are automatically downloaded from Hugging Face on first run:
- Lyrics/Director:
unsloth/Qwen3.6-35B-A3B-GGUF - Music:
Serveurperso/ACE-Step-1.5-GGUF - Video:
unsloth/LTX-2.3-GGUF - ASR:
Qwen/Qwen3-ASR-1.7B
- status.md: Project overview and completed features.
- problems.md: Current limitations and known issues.
- improvements.md: Future roadmap and optimizations.