Emotion, as an essential component of human communication, can be conveyed by various prosodic features, such as pitch, intensity, and speaking rate [1]. It plays an important role as a manifestation at semantic and pragmatic level of spoken languages. An adequate rendering of emotion in speech is critically important in expressive text-to-speech [2, 3], personalized speech synthesis, and intelligent dialogue systems, such as social robots and conversational agents.
Emotional voice conversion is a voice conversion (VC) technique for converting the emotion from the source utterance to the target utterance, while preserving the linguistic information and the speaker identity, as illustrated in Figure 1. It shares many similarities with conventional voice conversion. Both of them aim to convert non-linguistic information through mapping features from source to target. They are also different because conventional voice conversion techniques consider prosody-related features as speaker-independent. As speaker identity is thought to be characterized by the physical attributes of the speaker, which are strongly affected by the spectrum and determined by the voice quality of the individual [4], conventional VC studies mainly focus on spectrum conversion.
Codes & Speech Samples: https://kunzhou9646. github.io/Odyssey2020_emotional_VC//
Figure 1: An emotional voice conversion system is trained on speech data of different emotional patterns from the same speaker. At run-time, the system takes the speech of one emotion as input, and converts to that of another [7–9].
On the other hand, emotion is inherently supra-segmental and hierarchical in nature [5,6], that is manifested both in spectrum and prosody. Therefore, emotion cannot be handled simply at frame level, as it is insufficient to just convert the spectral features frame-by-frame.
Early studies of VC marked a success by training the spectral mapping on parallel speech data between source and target speaker [10, 11]. Many statistical approaches have been proposed in the past decades, such as Gaussian Mixture Model (GMM) [12], and Partial Least Square Regression (PLSR) [13]. Other VC methods, such as Non-negative Matrix Factorization (NMF) [14] and exemplar-based sparse representation schemes [15–17] were designed to address the over-smoothing problem in VC.
With the advent of deep learning, the performance of VC systems has been markedly improved. Neural Network (NN)-based methods, such as Restricted Boltzmann Machine (RBM) [18], Feed Forward NN [19], Deep Neural Network (DNN) [20], and Recurrent Neural Network (RNN) [21] have helped VC systems to achieve a higher level in terms of modeling the relationship between source and target features. More recently, some approaches have been proposed to eliminate the need of parallel data for VC such as Deep Bidirectional Long-Short-Term Memory (DBLSTM) with i-vector [22], variational auto-encoder [23], DBLSTM with model using Phonetic Posteriorgrams (PPGs) [24], and GANs [25–28]. The successful practice of these deep learning methods became the source of inspiration for this study.
The early studies on emotional VC [29, 30] only focused on prosody conversion by using a classification and regression tree to decompose the pitch contour of the source speech into a hierarchical structure, then followed by GMM and regressionbased clustering methods. One attempt to handle both spectrum and prosody conversion [31–33] was the GMM-based technique [7]. Another approach is a combination of Hidden Markov Model (HMM), GMM and F0 segment selection method for transforming F0, duration and spectrum, which was proposed in [34]. More recently, exemplar-based emotional VC approach based on NMF [8] and other NN-based models such as DNN [35], Deep Belief Network (DBN) [36] and DBLSTM [37] were also proposed to perform spectrum and prosody mapping. Inspired by the success of sequence-to-sequence models in text-to-speech synthesis, a sequence-to-sequence encoder-decoder based model [38] was also investigated to transform the intonation of a human voice, and can convert the emotion of neutral utterances effectively. Rule-based emotional VC approaches such as [39] are capable of controlling the degree of emotion using dimensional space such as arousal and valence.
We note that the training of most of the emotional VC systems relies on parallel training data, which is not practical in real life applications. Motivated by that, more recently, a style transfer auto-encoder [9] was proposed, which can learn from non-parallel training data. A source-target pair in non-parallel dataset represents source and target emotions. But, unlike those in parallel dataset, they can carry different linguistic content, that may make data collection much easier.
Prosody conveys linguistic, para-linguistic and various types of non-linguistic information, such as speaker identity, emotion, intention, attitude and mood. It is observed that prosody is influenced by short-term as well as long-term dependencies [40,41]. We note that F0 is an essential prosodic factor with respect to the intonation in speech, describing the variation of the vocal pitch over different time domains, from the syllables to the entire utterance. Therefore, it should be represented with hierarchical modeling [42–44], for example, in multiple time scales. The early studies on emotional voice conversion use a Logarithm Gaussian (LG)-based linear transformation method [7–9, 29, 30] to convert F0. Such single pitch value of F0 representation does not characterize speech prosody well [5, 6, 41]. Continuous Wavelet Transform (CWT) decomposes a signal into frequency components and represent it with different temporal scales, that becomes an excellent instrument. CWT has already been applied for speech voice conversion frameworks such as DKPLS [40] and exemplar-based conversion [44, 45]. It has been also shown to be effective for emotional voice conversion such as NMF-based approach [42, 43] and DBLSTM-based approach [37]; and for emotional speech synthesis have been investigated in [46–48].
In this paper, we propose an emotional VC framework with CycleGAN that is trained on non-parallel data to map a speaker’s speech from one emotion to another. We use melspectrum to represent the acoustic features and CWT coeffi-cients for prosodic features. Our framework does not rely on either parallel training data or any other extra modules such as speech recognition or time alignment procedures.
The main contributions of this paper include: 1) we propose a parallel-data-free emotional voice conversion framework; 2) we show the effect of prosody for emotional voice conversion; 3) we effectively convert spectral and prosodic features with CycleGAN; 4) we investigate different training strategies for spectrum and prosody conversion such as separate training and joint training; and 5) we outperform the baseline approaches, and achieve quality converted voice.
This paper is organized as follows: In Section 2, we describe the details of CycleGAN and CWT decomposition of F0. In Section 3, we explain our proposed spectrum and prosody conversion for emotional VC framework. Section 4 reports the experimental results. Conclusion is given in Section 5.
2.1. CycleGAN
Recently, generative adversarial learning has become very popular in machine learning applications, such as computer vision [49–52] and speech information processing [53, 54]. In this paper, we focus on a GAN-based network called CycleGAN, which is capable of learning a mapping between source and target
from non-parallel training data. It is based on the concept of adversarial learning [55], which is to train a generative model to find a solution in a min-max game between two neural networks, called as generator (G) and discriminator (D). CycleGAN was first proposed for computer vision [56,57], and then extended to various fields including speech synthesis and voice conversion [26,27,58].
A CycleGAN is incorporated with three losses: adversarial loss, cycle-consistency loss, and identity-mapping loss, learning forward and inverse mapping between source and target. Adversarial loss measures how distinguishable between the data distribution of converted data and source or target data. For the forward mapping, it is defined as:
The closer the distribution of converted data with that of target data, the smaller becomes. The adversarial loss only tells us whether
follows the distribution of target data but does not help to preserve the contextual information. In order to guarantee that the contextual information of x and
will be consistent, the cycle-consistency loss is given as:
This loss encourages to find an optimal pseudo pair of (x, y) through circular conversion. To preserve the linguistic information without any external processes, an identity mapping loss is introduced as below:
We note that CycleGAN is well-known for achieving remarkable results without parallel training data in many fields from computer vision to speech information processing. In this paper, we propose to use CycleGAN for spectrum and prosody conversion for emotional voice conversion with non-parallel training data.
2.2. Continuous Wavelet Transform (CWT)
It is well-known that emotion can be conveyed by various prosodic features, such as pitch, intensity and speaking rate. F0 is an essential part with respect to the intonation. We note that the modeling of F0 is a challenging task as F0 is discontinuous due to the unvoiced parts, and hierarchical in nature. As a multi-scale modeling method, CWT makes it possible to decompose F0 to different variations over multiple time scales.
Wavelet transform provides an easily interpretable visual representation of signals. Using CWT, a signal can be decomposed into different temporal scales. We note that CWT has
(b) ’It is well never to know an author’ in an angry tone. Figure 2: 10-scales CWT analysis of F0 [44,48] of an utterance in neutral and angry tone with the same linguistic content.
been successfully used in speech synthesis [59, 60] and voice conversion [41,45].
Given a bounded, continuous signal , its CWT representation
can be written as:
where is the Mexican hat mother wavelet. The original signal
can be recovered from the wavelet representation
inverse transform, given as:
However, if all information on is not available, the reconstruction is incomplete. In this study, we fix the analysis at ten discrete scales, one octave apart. The decomposition is given as:
The reconstructed is approximated as:
where . These timing scales were originally proposed in [61] and in prosody model [62, 63]. We believe that the prosody of emotion is expressed differently at different time scales. With the multi-scale representations, lower scales capture the short-term variations and higher scales capture the long-term variations. In this way, we are able to model and transfer the F0 variants from the micro-prosody level to the whole utterance level for emotion pairs. In Figure 2, we use an example to compare two utterances with the same content but different emotion across time scales.
In this section, we propose an emotional VC framework that performs both spectrum and prosody conversion using CycleConsistent Adversarial Networks. As an essential component of prosody, we propose to use CWT to decompose one-dimensional F0 into 10 time-scales. The proposed framework is trained on non-parallel speech data, eliminating the need of parallel training data and effectively converts the emotion of source speaker from one state to another.
The training phase of the proposed framework is given in Figure 3. We first extract spectral and F0 features from both source and target utterances using WORLD vocoder [64]. It is noted that F0 features extracted from WORLD vocoder are discontinuous, due to the voiced/unvoiced parts within an utterance. Since CWT is sensitive to the discontinuities in F0, we perform the following pre-processing steps for F0: 1) linear interpolation over unvoiced regions, 2) transformation of F0 from linear to logarithmic scale, and 3) normalization of the resulting F0 to zero mean and unit variance. We then perform the CWT decomposition of F0 as given in Eq. (6) and Algorithm 1.
We train CycleGAN for spectrum conversion with 24-dimensional Mel-cepstral coefficients (MCEPs), and for prosody conversion with 10-dimensional F0 features for each speech frame. We note that the source and target training data are from the same speaker, but consist of different linguistic content and different emotions. By learning forward and inverse mappings simultaneously using adversarial and cycle-consistency losses, we encourage CycleGAN to find an optimal mapping between source and target spectrum and prosody features.
The run-time conversion phase is shown in Figure 4. We first use the WORLD vocoder to extract spectral features, F0, and aperiodicity (AP) from a given source utterance. Similar to that of training phase, we encode spectral features as 24-
Figure 3: The training phase of the proposed CycleGAN-based emotional VC framework with WORLD vocoder. CWT is used to decompose F0 into 10 scales. Blue boxes are involved in the training, while grey boxes are not.
Figure 4: The run-time conversion phase of the proposed CycleGAN-based emotional VC framework. Pink boxes represent the networks which have been trained during the training phase.
dimensional MCEPs, and obtain 10-scale F0 features through CWT decomposition of F0, that is also reported in Algorithm 1. 24-dimensional MCEPs and 10-scale F0 are fed into the corresponding trained CycleGAN models to perform spectrum and prosody conversion separately. We reconstruct the converted F0 with CWT synthesis approximation method, that is given in Eq. (7) and Algorithm 2. Finally, we use WORLD vocoder to synthesize the converted emotional speech.
We conduct both objective and subjective experiments to assess the performance of our proposed parallel-data-free emotional VC framework. In this paper, we use the emotional speech corpus [65], which is recorded by a professional American actress, speaking English utterances with the same content in seven different emotions. We randomly choose four emotions,
that are 1) neutral, 2) angry, 3) sad, and 4) surprise.
We perform CWT to decompose F0 into 10 different scales and train CycleGAN using non-parallel training data to learn the relationships of spectral and prosody features between different emotions of the same speaker. CycleGAN-based spectrum conversion framework, denoted as baseline, is used as the reference framework. In this framework, F0 is transformed through LG-based linear transformation method.
We are also interested in the effect of joint and separate training for spectrum and prosody features. In joint training, we concatenate 24 MCEPs and 10 CWT coefficients to form a vector for each frame to train the joint spectrum-prosody CycleGAN. In separate training, we train a spectrum CycleGAN with the MCEP features, and a prosody CycleGAN with the CWT coefficients separately. Hereafter, we denote the separate training as CycleGAN-Separate, and the joint training as CycleGAN-Joint. The comparison of the frameworks can be also seen in Table 1.
4.1. Experimental Setup
The speech data in [65] is sampled at 16kHz with 16-bit per sample. The audio files for each emotion are manually segmented into 100 short parallel sentences (approximately 3 minutes). Among them, 90 and 10 sentences are provided as training and evaluation sets, respectively. In order to make sure that our proposed model is trained under non-parallel condition, the first 45 utterances are used for the source and the other 45 sentences are used for the target. 24 Mel-cepstral coefficients (MCEPs), fundamental frequency (F0), and APs are then extracted every 5 ms using WORLD vocoder [64]. As a pre-processing step, we normalize the source and target MCEPs per dimension.
We report the performance of three frameworks that use CycleGAN, namely 1) baseline 2) CycleGAN-Joint, and 3) CycleGAN-Separate. For the baseline, we extract 24-dimensional MCEPs and one-dimensional F0 features for each frame. For both CycleGAN-Separate and CycleGAN-Joint, each speech frame is represented with 24-dimensional MCEPs and 10-dimensional F0 features. We adopt the same network structure for all frameworks. We design the generators using a one-dimensional (1D) CNN to capture the relationship among the overall features while preserving the temporal structure.
Table 1: The comparison of the baseline, CycleGAN-Joint, and CycleGAN-Separate for spectrum and prosody conversion.
The 1D CNN is incorporated with down-sampling, residual, and up-sampling layers. As for the discriminator, a 2D CNN is employed. For all frameworks, we set only used for the first
iterations with
to guide the learning process.
We train the networks using the Adam optimizer with a batch size of 1. We set the initial learning rates to 0.0002 for the generators and 0.0001 for the discriminators. We keep the learning rate the same for the first iterations, which then linearly decays over the next
iterations. The momentum term
is set to be 0.5. As CycleGAN does not require source-target pair to be the same length, time alignment is not necessary.
4.2. Objective Evaluation
We perform objective evaluation to assess the performance of both spectrum and prosody conversion. In all experiments, we use 45-45 non-parallel utterances during training.
4.2.1. Spectrum Conversion
We employ Mel-cepstral distortion (MCD) between the converted and target Mel-cepstra to measure the spectrum conversion, that is given as follows:
where represent the converted and target MCEPs sequences, respectively. A lower MCD indicates better performance.
Table 2: A comparison of the MCD results between CycleGAN- Joint and CycleGAN-Separate for three different emotion combinations.
Table 2 reports the MCD values for a number of settings in a comparative study. The MCD values are calculated for both joint and separate training of spectrum and prosody features. We conducted the experiments for three emotion combinations: 1) neutral-to-angry, 2) neutral-to-sad, and 3) neutral-to-surprise. We observed that all separate training settings consistently outperform those of joint training settings by achieving lower MCD values. For example, the overall MCD of separate training is 8.71, while it is 10.23 for joint training.
We note that the baseline trains CycleGAN only with spectral features. Therefore, its spectral distortion is supposed to be the same with that of CycleGAN-Separate. That is the reason why MCD results of the baseline do not need to report in this case.
4.2.2. Prosody Conversion
We use Pearson Correlation Coefficient (PCC) and Root Mean Squared Error (RMSE) to report the performance of prosody conversion [41]. The RMSE between the converted F0 and the corresponding target F0 is defined as:
where denote the converted and target interpolated F0 features, respectively. N is the length of F0 sequence. We note that a lower RMSE value represents better F0 conversion performance.
The PCC between the converted and target F0 sequences is given as:
where are the standard deviations of the converted F0 sequences (
) and the target F0 sequences (
respectively. We note that a higher PCC value represents better F0 conversion performance.
Table 3 reports the RMSE and PCC values of F0 conversion for a number of settings in a comparative study. In this experiment, we conducted three emotional conversion settings: 1) neutral-to-angry, 2) neutral-to-sad, 3) neutral-to-surprise. We also report the overall performance. As for RMSE results, first of all, we observe that the proposed prosody conversion, based on CycleGAN with CWT-based F0 decomposition outperforms the traditional baseline (denoted as baseline) where F0 is converted with LG-based linear transformation method. Secondly, the proposed separate training with CycleGAN for spectrum and CWT-based prosody conversion overall achieves better result (RMSE: 63.03) than separate training (RMSE: 65.05), which is also consistent with the objective evaluation. PCC results suggest that both joint and separate training of CWT-based F0 features achieve similar results.
We would like to highlight that the proposed CWT-based modeling for F0 always outperforms the baseline framework that uses LG-based linear transformation method.
Table 3: A comparison of the RMSE and PCC results of the baseline, CycleGAN-Joint and CycleGAN-Separate for three different emotion combinations (neutral-to-angry, neutral-to-sad and neutral-to-surprise).
Figure 5: The XAB preference results with 95% confidence interval between the baseline and CycleGAN-Separate in emotion similarity experiments.
4.3. Subjective Evaluation
We further conduct two listening experiments to assess the proposed frameworks in terms of emotion similarity. We perform XAB test to assess the emotion similarity by asking listeners to choose the one which sounds more similar to the original target between A and B in terms of emotional expression. XAB test has been widely used in speech synthesis such as voice conversion [41], singing voice conversion [54] and emotional voice conversion [48]. In both experiments, 45-45 non-parallel utterances are used during training. We selected two emotion combinations for the listening experiments, that are 1) neutral-to-angry (N2A), and 2) neutral-to-surprise (N2S). 13 subjects participated in all the listening tests, each of them listens to 80 converted utterances in total.
We first conduct XAB test between the baseline and our proposed method to show the effect of our proposed framework that performs separate training of CycleGAN-based conversion for spectrum and CWT-based F0 modeling. Consistent with the previous experiments, our proposed framework is again denoted as CycleGAN-Separate. Listeners are asked to listen to the source utterances, the baseline, our proposed method and the reference utterances respectively. Then, they are asked to choose the one which sounds more similar to the reference in terms of emotional expression. We note that both frameworks perform spectral conversion in the same way, while our proposed framework performs a more sophisticated F0 conversion, which is modeling with CWT, and then converting with CycleGAN. The results are reported in Figure 5 for 2 different emotional conversion scenarios that are N2A and N2S. We observe that the proposed CycleGAN-Separate outperforms the baseline framework in both experiments, which shows the effectiveness of prosody modeling and conversion, for emotional voice conversion.
We then conduct XAB test between joint and separate training to assess different training strategies for spectrum and prosody conversion. The results are reported in Figure 6 for two different emotional conversion scenarios N2A and N2S. We observed that the performance of separate training (denoted as CycleGAN-Separate) is much better than the joint training (denoted as CycleGAN-Joint). Our proposed method achieves 93.6% on N2A and 96.5% on N2S, which we believe are remarkable.
Figure 6: The XAB preference results with 95% confidence interval between the CycleGAN-Joint and CycleGAN-Separate in emotion similarity experiments.
4.4. Joint vs. Separate Training of Spectrum and Prosody
We observe that the listeners prefer the separate training much more than the joint training. We consider that prosody is manifested at different time scales, which also consists of content-dependent and content-independent elements.
The joint training ties the CWT coefficients of F0 with the spectral features at the frame level, that assumes that prosody is content-dependent. With the limited number of training samples (45 pairs and around 3 minutes of speech), the CycleGAN model resulting from the joint training does not generalize well the emotional mapping for unseen content at run-time inference. With the separate training, the CycleGAN model is trained for spectrum and prosody separately. In this way, the prosody CycleGAN learns sufficiently well from the limited number of training samples between the emotion pairs in a content-independent manner. Therefore, separate training outperforms joint training in terms of emotion similarity.
In this paper, we propose a high-quality parallel-data-free emotional voice conversion framework. We perform both spectrum and prosody conversion based on CycleGAN. We provide a non-linear method which uses CWT to decompose F0 into different timing-scales. Moreover, we also study the joint and separate training of CycleGAN for spectrum and prosody conversion. We observe that separate training of spectrum and prosody can achieve better performance than joint training, in terms of emotion similarity. Experimental results show that our proposed emotional voice conversion framework can achieve better performance than the baseline with non-parallel training data.
This work is supported by Human-Robot Interaction Phase 1 (Grant No. 192 25 00054), National Research Foundation Singapore under the National Robotics Programme. It is also supported by National Research Foundation Singapore under the AI Singapore Programme (Award Number: AISG-100E-2018-006), and Programmatic Grant No. A18A2b0046 (Human Robot Collaborative AI for AME) and A1687b0033 (Neuromorphic Computing) from the Singapore Government’s Research, Innovation and Enterprise 2020 plan in the Advanced Manufacturing and Engineering domain.
[1] Klaus R Scherer, Rainer Banse, Harald G Wallbott, and Thomas Goldbeck, “Vocal cues in emotion encoding and decoding,” Motivation and emotion, vol. 15, no. 2, pp. 123–148, 1991.
[2] Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, and Haizhou Li, “Wavetts: Tacotron-based tts with joint time-frequency domain loss,” arXiv preprint arXiv:2002.00417, 2020.
[3] Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, and Haizhou Li, “Teacher-student training for robust tacotron-based tts,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
[4] S Ramakrishnan, Speech Enhancement, Modeling and Recognition-Algorithms and Applications, BoD–Books on Demand, 2012.
[5] Yi Xu, “Speech prosody: A methodological review,” Journal of Speech Sciences, vol. 1, no. 1, pp. 85–115, 2011.
[6] Javier Latorre and Masami Akamine, “Multilevel parametric-base f0 model for speech synthesis,” in Ninth Annual Conference of the International Speech Communication Association, 2008.
[7] Ryo Aihara, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki, “Gmm-based emotional voice conversion using spectrum and prosody features,” American Journal of Signal Processing, vol. 2, no. 5, pp. 134–138, 2012.
[8] Ryo Aihara, Reina Ueda, Tetsuya Takiguchi, and Yasuo Ariki, “Exemplar-based emotional voice conversion using non-negative matrix factorization,” in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE, 2014, pp. 1–7.
[9] Jian Gao, Deep Chakraborty, Hamidou Tembine, and Olaitan Olaleye, “Nonparallel emotional speech conversion,” arXiv preprint arXiv:1811.01174, 2018.
[10] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara, “Voice conversion through vector quantization,” Journal of the Acoustical Society of Japan (E), vol. 11, no. 2, pp. 71–76, 1990.
[11] Kiyohiro Shikano, Satoshi Nakamura, and Masanobu Abe, “Speaker adaptation and voice conversion by codebook mapping,” in 1991., IEEE International Sympoisum on Circuits and Systems. IEEE, 1991, pp. 594–597.
[12] Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
[13] Elina Helander, Tuomas Virtanen, Jani Nurminen, and Moncef Gabbouj, “Voice conversion using partial least squares regression,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 912–921, 2010.
[14] Daniel D Lee and H Sebastian Seung, “Algorithms for non-negative matrix factorization,” in Advances in neural information processing systems, 2001, pp. 556–562.
[15] Zhizheng Wu, Tuomas Virtanen, Eng Siong Chng, and Haizhou Li, “Exemplar-based sparse representation with residual compensation for voice conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1506– 1521, 2014.
[16] Berrak C¸ is¸man, Haizhou Li, and Kay Chen Tan, “Sparse representation of phonetic features for voice conversion with and without parallel data,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 677–684.
[17] Berrak Sisman, Mingyang Zhang, and Haizhou Li, “A voice conversion framework with tandem feature sparse representation and speaker-adapted wavenet vocoder.,” in Interspeech, 2018, pp. 1978–1982.
[18] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai, “Voice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, no. 12, pp. 1859– 1872, 2014.
[19] Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahallad, “Spectral mapping using artificial neural networks for voice conversion,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 954–964, 2010.
[20] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[21] Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki, “Highorder sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion,” in Fifteenth annual conference of the international speech communication association, 2014.
[22] Jie Wu, Zhizheng Wu, and Lei Xie, “On the use of i-vectors and average voice model for voice conversion without parallel data,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 2016, pp. 1–6.
[23] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 2016, pp. 1–6.
[24] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and Helen Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016, pp. 1–6.
[25] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” arXiv preprint arXiv:1704.00849, 2017.
[26] Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-free voice conversion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017.
[27] Takuhiro Kaneko and Hirokazu Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2100–2104.
[28] Berrak Sisman, Mingyang Zhang, Minghui Dong, and Haizhou Li, “On the study of generative adversarial networks for crosslingual voice conversion,” IEEE ASRU, 2019.
[29] Jianhua Tao, Yongguo Kang, and Aijun Li, “Prosody conversion from neutral speech to emotional speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1145–1154, 2006.
[30] Chung-Hsien Wu, Chi-Chun Hsia, Chung-Han Lee, and Mai- Chun Lin, “Hierarchical prosody conversion using regressionbased clustering for emotional speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1394–1405, 2009.
[31] Marc Schr¨oder, “Emotional speech synthesis: A review,” in Seventh European Conference on Speech Communication and Technology, 2001.
[32] Akemi Iida, Nick Campbell, Fumito Higuchi, and Michiaki Yasumura, “A corpus-based speech synthesis system with emotion,” Speech communication, vol. 40, no. 1-2, pp. 161–187, 2003.
[33] Shumin An, Zhenhua Ling, and Lirong Dai, “Emotional statistical parametric speech synthesis using lstm-rnns,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2017, pp. 1613– 1616.
[34] Zeynep Inanoglu and Steve Young, “Data-driven emotion conversion in spoken english,” Speech Communication, vol. 51, no. 3, pp. 268–283, 2009.
[35] Jaime Lorenzo-Trueba, Gustav Eje Henter, Shinji Takaki, Junichi Yamagishi, Yosuke Morino, and Yuta Ochiai, “Investigating different representations for modeling and controlling multiple emotions in dnn-based speech synthesis,” Speech Communication, vol. 99, pp. 135–143, 2018.
[36] Zhaojie Luo, Tetsuya Takiguchi, and Yasuo Ariki, “Emotional voice conversion using deep neural networks with mcc and f0 features,” in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS). IEEE, 2016, pp. 1–5.
[37] Huaiping Ming, Dongyan Huang, Lei Xie, Jie Wu, Minghui Dong, and Haizhou Li, “Deep bidirectional lstm modeling of timbre and prosody for emotional voice conversion,” 2016.
[38] Carl Robinson, Nicolas Obin, and Axel Roebel, “Sequence-to- sequence modelling of f0 for speech emotion conversion,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6830– 6834.
[39] Yawen Xue, Yasuhiro Hamada, and Masato Akagi, “Voice conversion for emotional speech: Rule-based synthesis with degree of emotion controllable in dimensional space,” Speech Communication, vol. 102, pp. 54–67, 2018.
[40] Gerard Sanchez, Hanna Silen, Jani Nurminen, and Moncef Gabbouj, “Hierarchical modeling of f0 contours for voice conversion,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[41] Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Group sparse representation with wavenet vocoder adaptation for spectrum and prosody conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, pp. 1085–1097, 2019.
[42] Huaiping Ming, Dongyan Huang, Minghui Dong, Haizhou Li, Lei Xie, and Shaofei Zhang, “Fundamental frequency modeling using wavelets for emotional voice conversion,” in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 2015, pp. 804–809.
[43] Huaiping Ming, Dongyan Huang, Lei Xie, Shaofei Zhang, Minghui Dong, and Haizhou Li, “Exemplar-based sparse representation of timbre and prosody for voice conversion,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5175–5179.
[44] Berrak S¸is¸man, Haizhou Li, and Kay Chen Tan, “Transformation of prosody in voice conversion,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2017, pp. 1537–1546.
[45] Berrak Sisman and Haizhou Li, “Wavelet analysis of speaker dependent and independent prosody for voice conversion.,” in Interspeech, 2018, pp. 52–56.
[46] Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi, and Yasuo Ariki, “Emotional voice conversion with adaptive scales f0 based on wavelet transform using limited amount of emotional data.,” in INTERSPEECH, 2017, pp. 3399–3403.
[47] Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi, and Yasuo Ariki, “Emotional voice conversion using neural networks with arbitrary scales f0 based on wavelet transform,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2017, no. 1, pp. 18, 2017.
[48] Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi, and Yasuo Ariki, “Emotional voice conversion using dual supervised adversarial networks with continuous wavelet transform f0 features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 10, pp. 1535–1548, 2019.
[49] Kenan Emir Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf Kassim, “Semantically consistent hierarchical text to fashion image synthesis with an enhanced-attentional generative adversarial network,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[50] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kassim, “Semantically consistent text to fashion image synthesis with an enhanced attentional generative adversarial network,” Pattern Recognition Letters, 2020.
[51] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas, “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5907–5915.
[52] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kassim, “Attribute manipulation generative adversarial networks for fashion images,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 10541–10550.
[53] Berrak Sisman, Mingyang Zhang, Sakti Sakriani, Haizhou Li, and Satoshi Nakamura, “Adaptive wavenet vocoder for residual compensation in gan-based voice conversion,” IEEE SLT, 2018.
[54] Berrak Sisman, Karthika Vijayan, Minghui Dong, and Haizhou Li, “SINGAN: Singing voice conversion with generative adversarial networks,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019, , no. December, 2019.
[55] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[56] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
[57] Yongyi Lu, Yu-Wing Tai, and Chi-Keung Tang, “Conditional cyclegan for attribute guided face image generation,” arXiv preprint arXiv:1705.09966, 2017.
[58] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6820–6824.
[59] Hans Kruschke and Michael Lenz, “Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis,” in Eighth European Conference on Speech Communication and Technology, 2003.
[60] Taniya Mishra, Jan Van Santen, and Esther Klabbers, “Decom- position of pitch curves in the general superpositional intonation model,” Speech Prosody, Dresden, Germany, 2006.
[61] Antti Santeri Suni, Daniel Aalto, Tuomo Raitio, Paavo Alku, Martti Vainio, et al., “Wavelets for intonation modeling in hmm speech synthesis,” in 8th ISCA Workshop on Speech Synthesis, Proceedings, Barcelona, August 31-September 2, 2013. ISCA, 2013.
[62] Martti Vainio, Antti Suni, Daniel Aalto, et al., “Continuous wavelet transform for analysis of speech prosody,” TRASP 2013-Tools and Resources for the Analysys of Speech Prosody, An Interspeech 2013 satellite event, August 30, 2013, Laboratoire Parole et Language, Aix-en-Provence, France, Proceedings, 2013.
[63] Berrak Sisman, Grandee Lee, and Haizhou Li, “Phonetically aware exemplar-based prosody transformation,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 267–274.
[64] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “World: a vocoder-based high-quality speech synthesis system for realtime applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
[65] Shuojun Liu, Dong-Yan Huang, Weisi Lin, Minghui Dong, Haizhou Li, and Ee Ping Ong, “Emotional facial expression transfer based on temporal restricted boltzmann machines,” in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE, 2014, pp. 1– 7.