MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts

Paper Information

Authors

Heyang Xue, Xuchen Song†, Yu Tang, Jianyu Chen, Yanru Chen, Yang Li, Yahui Zhou

Kunlun Inc.

Email: {heyang.xue, xuchen.song}@kunlun-inc.com

† Corresponding author

Abstract

Description-based text-to-speech (TTS) models exhibit strong performance on in-domain text descriptions, i.e., those encountered during training. However, in real-world applications, the diverse range of user-generated descriptions inevitably introduces numerous out-of-domain inputs that challenge the text understanding capabilities of these systems. To address this issue, we propose MoE-TTS, a description-based TTS model designed to enhance the understanding of out-of-domain text descriptions. MoE-TTS employs a modality-based mixture-of-experts (MoE) approach to augment a pre-trained textual large language model (LLM) with a set of specialized weights adapted to the speech modality while maintaining the original LLM forzen during training. This approach allows MoE-TTS to effectively leverage the pre-trained knowledge and text understanding abilities of textual LLMs. Our experimental results indicate that: first, even the most advanced closed-source commercial products can be challenged by carefully designed out-of-domain description test sets; second, MoE-TTS achieves superior performance in generating speech that more accurately reflects the descriptions.

Out-of-domain Descriptions Showcases

Note: As of early August 2025, the ElevenLabs multilingual_v2 API was used, as the alpha v3 remained inaccessible.

Case 1: Middle-aged Puppet Marriage Counselor

Description: Middle-aged puppet marriage counselor, unusually calm, with a silky smooth yet hollow contralto voice, speaking at a suffocating pace.

Chinese translated description: 中年傀儡婚姻顾问，异常平静，丝绸般平滑却空洞的女低音，语速慢得令人窒息。

Text to be synthesized: "Relationships erode over centuries like mountains shaping valleys but sometimes you must weather the same storm together to grow stronger."

MoE-TTS

ElevenLabs

MiniMax

Comments: The contralto voice implies that the gender of the generated result should be female. MoE-TTS understood this implication, but ElevenLabs and MiniMax did not.

Case 2: US Actor with New York Accent

Description: US actor with a New York accent, versatile, articulate, with a dynamic pace, full of charm and charisma, attracting the attention of the audience.

Chinese translated description: 美国男演员，带有纽约口音，多才多艺，表达能力强，语速富有活力，充满魅力与感染力，吸引着听众的注意。

Text to be synthesized: "Ay! Macbeth's soliloquy isn't words, it's blood and thunder in the mouth!"

MoE-TTS

ElevenLabs

MiniMax

Comments: MoE-TTS shows off its strongest expressiveness and aligns more closely with the descriptions.

Case 3: Iron-Willed Strategist

Description: Iron-Willed Strategist, Female, mature (35-50), deep and resonant voice, commanding tone with biting wit no-nonsense, formidable, and effortlessly dominant.

Chinese translated description: 钢铁意志战略家，女性，成熟（35-50岁），声音低沉洪亮，语气强势，略带尖刻的智慧——严肃认真、令人敬畏，且不费吹灰之力便能掌控全局。

Text to be synthesized: "Victory isn't debated, it's seized. Every second spent doubting is a gift to your enemies. Do I make myself clear?"

MoE-TTS

ElevenLabs

MiniMax

Comments: MoE-TTS emphasizes key words in sentences (gift!). Overall, it is more in line with a deep, resonant voice, a strong tone, and a serious, awe-inspiring demeanor.

Case 4: Talking Taser

Description: Talking Taser, Female, 20-35, chipmunk-on-jet-fuel energy. Words fire like a machine gun, punctuated by dolphin-like yips. Occasionally short-circuits into gibberish.

Chinese translated description: 会说话的泰瑟枪，女，20-35岁，精力充沛得像花栗鼠一样。话语如机关枪般射出，不时夹杂着海豚般的叫声。偶尔会短路，变成胡言乱语。

Text to be synthesized: "Oh my gosh! That's the biggest cupcake ever wait is it alive? It winked at me! Nuh-uh! Okay maybe just a little bite Ahhhh it bit me back!"

MoE-TTS

ElevenLabs

MiniMax

Comments: The description uses several metaphors to imply that the generated results should be high-pitched, fast, loud, and sharp. MoE-TTS understood the metaphorical technique and generated the most expressive results.

Case 5: Anime Character Dubbing

Description: Anime character dubbing, young male voice, exaggerated emotional expression, fast pace, fully showcasing the character's personality and emotions, increasing the fun of the story.

Chinese translated description: 动漫人物配音，年轻男声，情感表达夸张，语速快，充分展现角色的个性与情绪，增加故事的趣味性。

Text to be synthesized: "NANI?! You expect me to believe that was your final form? HAH! I haven't even used 10% of my power yet! BANZAI!"

MoE-TTS

ElevenLabs

MiniMax

Comments: The description "exaggerated" implies that the generated results should be highly expressive. MoE-TTS generated the strongest expressiveness and was more consistent with a younger age.

MoE-TTS

Paper Information

Authors

Abstract

Framework Architecture

Out-of-domain Descriptions Showcases

Case 1: Middle-aged Puppet Marriage Counselor

MoE-TTS

ElevenLabs

MiniMax

Case 2: US Actor with New York Accent

MoE-TTS

ElevenLabs

MiniMax

Case 3: Iron-Willed Strategist

MoE-TTS

ElevenLabs

MiniMax

Case 4: Talking Taser

MoE-TTS

ElevenLabs

MiniMax

Case 5: Anime Character Dubbing

MoE-TTS

ElevenLabs

MiniMax