In this video, I introduce the V2 of my "Ultimate Qwen3 TTS Workflow" for ComfyUI. The major update is the full integration with RVC (Retrieval-based Voice Conversion) to create what I call "Director's Mode".
Standard voice cloning often has a problem: it copies the reference audio's accent or results in a monotone delivery. With this V2 workflow, you can separate the "Acting" from the "Timbre". You use Qwen3 to direct the performance (emotion, pacing, whispers) and RVC to apply the specific character voice on top.
🚀 What's New in V2:
Director's Mode: Full control over emotion and intonation while keeping the target voice character.
RVC Integration: Load models, .index files, and tweak Pitch/Index directly in ComfyUI.
Low VRAM Optimized: Running smoothly on a GTX 1060 (6GB).
Bypass Group: Easily disable the RVC module to save resources when designing the base voice.
Smart Settings: How to use LLMs (like ChatGPT/Gemini) to find the perfect RVC settings for any model.
📂 Download the Workflow:
🔗 CivitAI: Link on the fixed comment
🛠️ Tools Used:
ComfyUI
Qwen2-Audio-7B-Instruct (via Qwen3 nodes)
RVC (Retrieval-based Voice Conversion)
⏱️ Timestamps:
0:00 - Intro: V1 vs. V2 & The "Sad/Aggressive" Problem
0:33 - The Solution: How "Director's Mode" Works
1:10 - Demo: The Flaw of Standard Cloning (Accent/Monotone)
1:50 - Step 1: Directing the "Acting" with Qwen3
3:05 - Step 2: Setting up RVC (Pro Tip using ChatGPT)
4:49 - The Result: Perfect Acting + Target Voice
6:00 - Final Settings & Low VRAM Tips
If you enjoyed this workflow, please leave a review ⚡ on CivitAI and subscribe for more updates! 👍