MoE-TTS

Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts

Paper Information

Authors

Heyang Xue, Xuchen Song†, Yu Tang, Jianyu Chen, Yanru Chen, Yang Li, Yahui Zhou

Kunlun Inc.

Email: {heyang.xue, xuchen.song}@kunlun-inc.com

† Corresponding author

Abstract

Description-based text-to-speech (TTS) models exhibit strong performance on in-domain text descriptions, i.e., those encountered during training. However, in real-world applications, the diverse range of user-generated descriptions inevitably introduces numerous out-of-domain inputs that challenge the text understanding capabilities of these systems. To address this issue, we propose MoE-TTS, a description-based TTS model designed to enhance the understanding of out-of-domain text descriptions. MoE-TTS employs a modality-based mixture-of-experts (MoE) approach to augment a pre-trained textual large language model (LLM) with a set of specialized weights adapted to the speech modality while maintaining the original LLM forzen during training. This approach allows MoE-TTS to effectively leverage the pre-trained knowledge and text understanding abilities of textual LLMs. Our experimental results indicate that: first, even the most advanced closed-source commercial products can be challenged by carefully designed out-of-domain description test sets; second, MoE-TTS achieves superior performance in generating speech that more accurately reflects the descriptions.

Framework Architecture

MoE-TTS Framework

Building on the core idea of enhancing description-based TTS through the pre-trained knowledge and text understanding capabilities of large language models, we propose MoE-TTS. Our approach employs a mixture-of-experts framework that utilizes a modality-based routing strategy and modality-aware Transformer components to bridge large language models and TTS models.

Out-of-domain Descriptions Showcases

Note: As of early August 2025, the ElevenLabs multilingual_v2 API was used, as the alpha v3 remained inaccessible.

Case 1: Middle-aged Puppet Marriage Counselor

Description: Middle-aged puppet marriage counselor, unusually calm, with a silky smooth yet hollow contralto voice, speaking at a suffocating pace.
Chinese translated description: 中年傀儡婚姻顾问,异常平静,丝绸般平滑却空洞的女低音,语速慢得令人窒息。
Text to be synthesized: "Relationships erode over centuries like mountains shaping valleys but sometimes you must weather the same storm together to grow stronger."

MoE-TTS

ElevenLabs

MiniMax

Comments: The contralto voice implies that the gender of the generated result should be female. MoE-TTS understood this implication, but ElevenLabs and MiniMax did not.

Case 2: US Actor with New York Accent

Description: US actor with a New York accent, versatile, articulate, with a dynamic pace, full of charm and charisma, attracting the attention of the audience.
Chinese translated description: 美国男演员,带有纽约口音,多才多艺,表达能力强,语速富有活力,充满魅力与感染力,吸引着听众的注意。
Text to be synthesized: "Ay! Macbeth's soliloquy isn't words, it's blood and thunder in the mouth!"

MoE-TTS

ElevenLabs

MiniMax

Comments: MoE-TTS shows off its strongest expressiveness and aligns more closely with the descriptions.

Case 3: Iron-Willed Strategist

Description: Iron-Willed Strategist, Female, mature (35-50), deep and resonant voice, commanding tone with biting wit no-nonsense, formidable, and effortlessly dominant.
Chinese translated description: 钢铁意志战略家,女性,成熟(35-50岁),声音低沉洪亮,语气强势,略带尖刻的智慧——严肃认真、令人敬畏,且不费吹灰之力便能掌控全局。
Text to be synthesized: "Victory isn't debated, it's seized. Every second spent doubting is a gift to your enemies. Do I make myself clear?"

MoE-TTS

ElevenLabs

MiniMax

Comments: MoE-TTS emphasizes key words in sentences (gift!). Overall, it is more in line with a deep, resonant voice, a strong tone, and a serious, awe-inspiring demeanor.

Case 4: Talking Taser

Description: Talking Taser, Female, 20-35, chipmunk-on-jet-fuel energy. Words fire like a machine gun, punctuated by dolphin-like yips. Occasionally short-circuits into gibberish.
Chinese translated description: 会说话的泰瑟枪,女,20-35岁,精力充沛得像花栗鼠一样。话语如机关枪般射出,不时夹杂着海豚般的叫声。偶尔会短路,变成胡言乱语。
Text to be synthesized: "Oh my gosh! That's the biggest cupcake ever wait is it alive? It winked at me! Nuh-uh! Okay maybe just a little bite Ahhhh it bit me back!"

MoE-TTS

ElevenLabs

MiniMax

Comments: The description uses several metaphors to imply that the generated results should be high-pitched, fast, loud, and sharp. MoE-TTS understood the metaphorical technique and generated the most expressive results.

Case 5: Anime Character Dubbing

Description: Anime character dubbing, young male voice, exaggerated emotional expression, fast pace, fully showcasing the character's personality and emotions, increasing the fun of the story.
Chinese translated description: 动漫人物配音,年轻男声,情感表达夸张,语速快,充分展现角色的个性与情绪,增加故事的趣味性。
Text to be synthesized: "NANI?! You expect me to believe that was your final form? HAH! I haven't even used 10% of my power yet! BANZAI!"

MoE-TTS

ElevenLabs

MiniMax

Comments: The description "exaggerated" implies that the generated results should be highly expressive. MoE-TTS generated the strongest expressiveness and was more consistent with a younger age.