Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models
 Heyang Xue, Shuai Guo, Pengcheng Zhu, Mengxiao Bi 
 
                             Fuxi AI Lab, NetEase Inc., Hangzhou, China
Despite imperfect score-matching causing drift in training and sampling distributions of diffusion models, recent advances in diffusion-based acoustic models have revolutionized data-sufficient single-speaker Text-to-Speech (TTS) approaches, with Grad-TTS being a prime example. However, the sampling drift problem leads to these approaches struggling in multi-speaker scenarios in practice due to more complex target data distribution compared to single-speaker scenarios. In this paper, we present Multi-GradSpeech, a multi-speaker diffusion-based acoustic models which introduces the Consistent Diffusion Model (CDM) as a generative modeling approach. We enforce the consistency property of CDM during the training process to alleviate the sampling drift problem in the inference stage, resulting in significant improvements in multi-speaker TTS performance. Our experimental results corroborate that our proposed approach can improve the performance of different speakers involved in multi-speaker TTS compared to Grad-TTS, even outperforming the fine-tuning approach.
Experiment results on a internel multi-speaker Mandarin speech corpus where Speaker-S and Speaker-I stands for data-sufficient speaker and data-insufficient speaker respectively
| Models | Results | 
|---|---|
| (0) Recording |  | 
| (1) Grad-TTS Single-Speaker Speaker-S |  | 
| (2) Grad-TTS Multi-Speaker Speaker-S |  | 
| (3) Grad-TTS Fine-tune Speaker-S |  | 
| (4) Multi-GradSpeech Single Speaker Speaker-S |  | 
| (5) Multi-GradSpeech Multi Speaker Speaker-S |  | 
| (6) Multi-GradSpeech Fine-tune Speaker-S |  | 
| (7) Grad-TTS Single-Speaker Speaker-I |  | 
| (8) Grad-TTS Multi-Speaker Speaker-I |  | 
| (9) Grad-TTS Fine-tune Speaker-I |  | 
| (10) Multi-GradSpeech Single Speaker Speaker-I |  | 
| (11) Multi-GradSpeech Multi Speaker Speaker-I |  | 
| (12) Multi-GradSpeech Fine-tune Speaker-I |  | 
| Models | Results | 
|---|---|
| (0) Recording |  | 
| (1) Grad-TTS Single-Speaker Speaker-S |  | 
| (2) Grad-TTS Multi-Speaker Speaker-S |  | 
| (3) Grad-TTS Fine-tune Speaker-S |  | 
| (4) Multi-GradSpeech Single Speaker Speaker-S |  | 
| (5) Multi-GradSpeech Multi Speaker Speaker-S |  | 
| (6) Multi-GradSpeech Fine-tune Speaker-S |  | 
| (7) Grad-TTS Single-Speaker Speaker-I |  | 
| (8) Grad-TTS Multi-Speaker Speaker-I |  | 
| (9) Grad-TTS Fine-tune Speaker-I |  | 
| (10) Multi-GradSpeech Single Speaker Speaker-I |  | 
| (11) Multi-GradSpeech Multi Speaker Speaker-I |  | 
| (12) Multi-GradSpeech Fine-tune Speaker-I |  | 
| Models | Results | 
|---|---|
| (0) Recording |  | 
| (1) Grad-TTS Single-Speaker Speaker-S |  | 
| (2) Grad-TTS Multi-Speaker Speaker-S |  | 
| (3) Grad-TTS Fine-tune Speaker-S |  | 
| (4) Multi-GradSpeech Single Speaker Speaker-S |  | 
| (5) Multi-GradSpeech Multi Speaker Speaker-S |  | 
| (6) Multi-GradSpeech Fine-tune Speaker-S |  | 
| (7) Grad-TTS Single-Speaker Speaker-I |  | 
| (8) Grad-TTS Multi-Speaker Speaker-I |  | 
| (9) Grad-TTS Fine-tune Speaker-I |  | 
| (10) Multi-GradSpeech Single Speaker Speaker-I |  | 
| (11) Multi-GradSpeech Multi Speaker Speaker-I |  | 
| (12) Multi-GradSpeech Fine-tune Speaker-I |  | 
Email: geminiwelkin@gmail.com, xueheyang@corp.netease.com