ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-mixed Multi-speaker Speech Synthesis
Submitted to INTERSPEECH 2024 (Paper ID: 608)
Abstract
Recent text-to-speech (TTS) models have synthesized remarkably natural speech for code-mixed TTS as well as cross-lingual TTS. However, code-mixed texts are synthesized with unnatural accents for each word because speaker-related features can include linguistic features from the speaker’s source language. To solve the problems, we propose ClariTTS, which synthesizes speech with appropriate accents for the language of each word in code-mixed texts. Specifically, we propose feature-ratio normalized affine coupling layer in the flow-based TTS model, which disentangles speaker and linguistic features to prevent the accent of the target speaker’s source language from being included in the target language. Furthermore, we introduce a duration stabilization training objectives to ensure stable duration prediction in code-mixed TTS. From the experimental results, we demonstrate that ClariTTS reliably generates code-mixed speech with clear pronunciation while preserving speaker identity.
Intra-lingual TTS samples
Method |
|
|
Ground truth | ||
Text |
새로운 발견을 하고 있습니다. (Scientists are constantly making new discoveries in the field of genetics.) |
|
MS-iSTFT-VITS | ||
YourTTS | ||
SANE-TTS | ||
ClariTTS(Proposed) | ||
w/o cross-speaker duration loss |
||
w/o FRN, cross-speaker duration loss |
Cross-lingual TTS samples
Method |
|
|
Ground truth | ||
Text |
|
벽난로 주변에 모였습니다. (On a cold winter night, families gathered around the fireplace.) |
MS-iSTFT-VITS | ||
YourTTS | ||
SANE-TTS | ||
ClariTTS(Proposed) | ||
w/o cross-speaker duration loss |
||
w/o FRN, cross-speaker duration loss |
Code-mixed TTS samples
Method |
|
|
||
Ground truth | ||||
Text |
Apple CarPlay가 있어요. (Some of the media that can play music include YouTube Music and Apple CarPlay.) |
(One Americano and two Black teas, please.) |
(One Americano and two Black teas, please.) |
Apple CarPlay가 있어요. (Some of the media that can play music include YouTube Music and Apple CarPlay.) |
MS-iSTFT-VITS | ||||
YourTTS | ||||
SANE-TTS | ||||
ClariTTS(Proposed) | ||||
w/o cross-speaker duration loss |
||||
w/o FRN, cross-speaker duration loss |