Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

We present generated emotional speech samples from three emotional text-to-speech (TTS) models.

Target : Target samples are provided for your reference.

emospeech : Baseline emospeech model.

cosyvoice : Baseline cosyvoice model.

Our Emo-DPO : Our proposed Emo-DPO model.

Trulli — Figure 1. Overview of the proposed Emo-DPO approach: (a) instruction tuning, (b) Emo-DPO training, and (c) the inference process.

Sample 1 (Emotion: Angry)

Text: Monster made a deep bow.

	Target	emospeech	cosyvoice	our Emo-DPO
Samples

Sample 2 (Emotion: Surprise)

Text: I thought you meant how old are you?

	Target	emospeech	cosyvoice	our Emo-DPO
Samples

Sample 3 (Emotion: Happy)

Text: She is now choosing skirt to wear.

	Target	emospeech	cosyvoice	our Emo-DPO
Samples

Sample 4 (Emotion: Neutral)

Text: Take courage all isn't lost yet.

	Target	emospeech	cosyvoice	our Emo-DPO
Samples

Sample 5 (Emotion: Angry)

Text: You are not a runaway, who are you?

	Target	emospeech	cosyvoice	our Emo-DPO
Samples

Sample 6 (Emotion: Sad)

Text: I chose the right way.

	Target	emospeech	cosyvoice	our Emo-DPO
Samples

Sample 7 (Emotion: Surprise)

Text: The football teams give a tea party.

	Target	emospeech	cosyvoice	our Emo-DPO
Samples