SP121-Multi-speaker Emotional Acoustic Modeling for CNN-based Speech Synthesis


In this paper, we investigate multi-speaker emotional acoustic modeling methods for convolutional neural network (CNN) based speech synthesis system. For emotion modeling, we extend to the speech synthesis system that learns a latent embedding space of emotion, derived from a desired emotional identity, and we use emotion code and mel-frequency spectrogram as an emotion identity. In order to model speaker variation in a text-to-speech (TTS) system, we use speaker representations such as trainable speaker embedding and speaker code. We have implemented speech synthesis systems combining speaker representation and emotion representation and compared them by experiments. Experimental results have demonstrated that the multi-speaker emotional speech synthesis approach using trainable speaker embedding and emotion representation from mel spectrogram achieves higher performance when compared with other approaches in terms of naturalness, speaker similarity, and emotion similarity.


There are no reviews yet.

Be the first to review “SP121-Multi-speaker Emotional Acoustic Modeling for CNN-based Speech Synthesis”
Contact UsHere's your new discount product tab.
Shopping Cart