ClariTTS Feature-ratio Normalization and Duration Stabilization for Code-mixed Multi-speaker Speech Synthesis

ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-mixed Multi-speaker Speech Synthesis

Submitted to INTERSPEECH 2024 (Paper ID: 608)

Abstract

Recent text-to-speech (TTS) models have synthesized remarkably natural speech for code-mixed TTS as well as cross-lingual TTS. However, code-mixed texts are synthesized with unnatural accents for each word because speaker-related features can include linguistic features from the speaker’s source language. To solve the problems, we propose ClariTTS, which synthesizes speech with appropriate accents for the language of each word in code-mixed texts. Specifically, we propose feature-ratio normalized affine coupling layer in the flow-based TTS model, which disentangles speaker and linguistic features to prevent the accent of the target speaker’s source language from being included in the target language. Furthermore, we introduce a duration stabilization training objectives to ensure stable duration prediction in code-mixed TTS. From the experimental results, we demonstrate that ClariTTS reliably generates code-mixed speech with clear pronunciation while preserving speaker identity.

Intra-lingual TTS samples

Method
Korean speaker
English speaker
Ground truth
Text
과학자들은 유전학 분야에서 지속적으로
새로운 발견을 하고 있습니다.
(Scientists are constantly making new
discoveries in the field of genetics.)
I prefer tea over coffee.
MS-iSTFT-VITS
YourTTS
SANE-TTS
ClariTTS(Proposed)
w/o cross-speaker
duration loss
w/o FRN, cross-speaker
duration loss

Cross-lingual TTS samples

Method
Korean speaker
English speaker
Ground truth
Text
The traffic is backed up on the highway.
추운 겨울 밤, 가족들이
벽난로 주변에 모였습니다.
(On a cold winter night, families
gathered around the fireplace.)
MS-iSTFT-VITS
YourTTS
SANE-TTS
ClariTTS(Proposed)
w/o cross-speaker
duration loss
w/o FRN, cross-speaker
duration loss

Code-mixed TTS samples

Method
Korean speaker
English speaker
Ground truth
Text
음악을 재생할 수 있는 매체로는 Youtube Music,
Apple CarPlay가 있어요.
(Some of the media that can play music include
YouTube Music and Apple CarPlay.)
Americano 한 잔, Black tea 두 잔 주세요.
(One Americano and two Black teas, please.)
Americano 한 잔, Black tea 두 잔 주세요.
(One Americano and two Black teas, please.)
음악을 재생할 수 있는 매체로는 Youtube Music,
Apple CarPlay가 있어요.
(Some of the media that can play music include
YouTube Music and Apple CarPlay.)
MS-iSTFT-VITS
YourTTS
SANE-TTS
ClariTTS(Proposed)
w/o cross-speaker
duration loss
w/o FRN, cross-speaker
duration loss