We present generated emotional speech samples from three emotional text-to-speech (TTS) models.


Target : Target samples are provided for your reference.

emospeech : Baseline emospeech model.

cosyvoice : Baseline cosyvoice model.

Our Emo-DPO : Our proposed Emo-DPO model.

Trulli
Figure 1. Overview of the proposed Emo-DPO approach: (a) instruction tuning, (b) Emo-DPO training, and (c) the inference process.


Sample 1 (Emotion: Angry)
Text: Monster made a deep bow.
Target emospeech cosyvoice our Emo-DPO
Samples


Sample 2 (Emotion: Surprise)
Text: I thought you meant how old are you?
Target emospeech cosyvoice our Emo-DPO
Samples


Sample 3 (Emotion: Happy)
Text: She is now choosing skirt to wear.
Target emospeech cosyvoice our Emo-DPO
Samples


Sample 4 (Emotion: Neutral)
Text: Take courage all isn't lost yet.
Target emospeech cosyvoice our Emo-DPO
Samples


Sample 5 (Emotion: Angry)
Text: You are not a runaway, who are you?
Target emospeech cosyvoice our Emo-DPO
Samples


Sample 6 (Emotion: Sad)
Text: I chose the right way.
Target emospeech cosyvoice our Emo-DPO
Samples


Sample 7 (Emotion: Surprise)
Text: The football teams give a tea party.
Target emospeech cosyvoice our Emo-DPO
Samples