Voice-controlled IoT services have become quite popular. Seamless interaction between the users and services are enabled through speech recognition. Many IoT devices have built-in microphones that listen for user commands, such as instructing a TV to turn on, or the coffee machine to prepare coffee. The intelligence of these services is constantly evolving so as to better understand their users’ states. For instance, several online recommendation systems rely on speaker identification or emotion recognition to provide recommendations for purchases or restaurants. These suggestions may be presented based on the age of the user or the user’s current mood.
By increasing the capabilities of voice-control systems, it becomes feasible to launch new privacy and security attacks. Voice is one of the most important sources of affective data. It includes various embedded metadata such as the who, when, where and what that may be extracted from the voice signal. Sending a raw signal to a service provider’s cloud for further analysis can reveal deep sensitive personal information. Alepis and Patsakis in [2] presented and analysed the potential risks of voice assistants in mobile devices, showing how urgent it is to develop privacy-preserving architectures for speech analysis by extracting the distinguish features from the speech without compromising individual privacy.
Recently, computational paralinguistics has attracted the attention of researchers due to its prominent potential for practical IoT applications. It helps to understand diverse speaker states, traits, and vocal behaviours [23]. One of the most popular objectives of computational paralinguistics is emotion recognition to enable naturalistic human-computer interaction. It has enhanced the quality of several cloud-based services such as for call centres, but will raise concerns for privacy and data security. The main question here is how to analyse speech without disturbing the user’s privacy in terms of removing sensitive information from the voice signal before releasing it to a third party. In this work, a privacy-preserving framework based on voice conversion is proposed to sanitize speech data. It aims to normalise a sensitive part of the speech (such as emotion state) while preserving the signal utility (speech content) before sending it to the cloud for further analysis. Firstly, the initial processing of the speech is done to calculate the features to be hidden and then used as a target to train the feature extraction model. CycleGAN architecture [29] is used as the feature extractor. Then, the output features are used to re-generate the voice files using a state-of-the-art vocoder: WORLD [19].
To evaluate the trade-off between data utility and privacy, the proposed method is tested on an emotion recognition task using the RAVDESS dataset [12]. The results show that the proposed solution can decrease the the accuracy of state-of-the-art paralinguistic models such as emotion recognition, while affecting the accuracy of speech recognition and speaker identification techniques only minimally. The code and results are available from the project page1.
Due to resource limitations on edge devices, speech analysis is outsourced to the cloud for best performance. However, service providers aim to expand their abilities to understand the additional information about the speakers by developing models that process their voice input and detect their current conditions. They are able to collect sensitive behaviour patterns from voice input that may violate user privacy in numerous ways. They may infer a users’ mental state, stress level, smoking habits, overall health conditions, indication of Parkinson’s disease, sleep patterns, and levels of exercise [21]. For instance, Amazon has patented technology that can analyse users’ voice to determine emotions and/or mental health conditions. It allows understanding speaker commands and responses according to their feeling to provide highly targeted content [1]. Therefore, speech analysis seems to play an especially important role when it comes to advertising content related to physical or emotional states.
Emotions are a universal aspect of human speech that convey their behaviour. The physical characteristics of emotion expression in several psychiatric conditions has been investigated in [8]. As a consequence of listening to users’ voices and monitoring emotions, resulting critical decisionmaking may effect the users life, ranging from fitness trackers for well-being to suitability for recruitment. Adding these feelings or health conditions to user profiles at the service provider side will open many new privacy issues. Therefore, a proposed privacy-preserving layer is proposed and evaluated to protect the emotional privacy as a sensitive part of speech while maintaining user experience. This bridges the communication between users and a service provider cloud, and serves as a wrapper of the emotional part of the voice input to prevent service providers from monitoring users’ emotions that associated with their speech.
Generative Adversarial Networks (GANs) consist of two networks: a generator and a discriminator [7]. Generator aims to generate new data similar to the expected one, while discriminator recognizes if an input data is real or fake, as produced by the generator. CycleGAN [29] is a custom model of GANs that uses two generators and two discriminators. By considering X and Y as different domains that generators task to convert from X to Y and vice versa. Generator (G) maps from domain X to Y, and generator (F) maps from Y to X. In addition, two adversarial discriminators D (X) and D (Y), where D (X) aims to distinguish between objects in X domain and output objects from F (Y), and D (Y) aims to discriminate between (Y) and the output of G (X).
Two objective functions are the power of CycleGAN: an adversarial loss, and a cycle consistency loss. Adversarial
Figure 1: Block diagram of the proposed privacy-preserving framework for speech analysis
loss expresses the objective of the generators that attempt to fool its corresponding discriminator into being less able to distinguish its generated output from the real one. Cyclic loss calculates the loss of translating a sample from Y to X and then back again to Y. The full objective is:
where hyperparameter controls the relative significance of the two objectives. CycleGAN is a core architecture behind the privacy-preserving framework for speech analysis that aims to learn sensitive representations in speech. Similar to the CycleGAN-VC2 model in [10], the features that are extracted by CycleGAN can be used to transform the sensitive information in the speech with other non-sensitive data, without loss of utility for specific tasks. The proposed framework consists of the following modules, as shown in Figure 1.
3.1 Pre-processing
Unsupervised learning is used to extract the representation from speech. The distinguishing signal features are extracted by performing transformations or sampling strategies to the input data, and using the resulting outcomes as labels [6]. The most effective features in speech processing are F0 counter, spectral envelope and aperiodic information. To accomplish this, WORLD is used to estimate different features of the raw speech signal using three algorithms. Firstly, fast and reliable F0 extractor is applied to find intervals of zero crossings and peaks of a waveform to estimate the fundamental frequency (F0) of the speech [18]. CheapTrick is implemented to estimate the spectral envelope using F0 information [16]. Finally, Definitive Decomposition Derived Dirt-Cheap (D4C) estimates the aperiodicity of the speech signal, which is the
Figure 2: Acoustic features comparison between three emotional states: natural, angry, and happy respectively.
Figure 3: A basic architecture of CycleGAN to transform acoustic parameters of emotional utterances such that the modified speech conveys a netural utterance.
power ratio between the speech signal and the aperiodic component of the signal, i.e. noise [17].
3.2 Conversion Process
Leveraging non-parallel VC, the feature extraction model is trained to identify the sensitive features from the speech and convert them to non-sensitive features. For example, the prosody features are the most related in emotion recognition tasks that can be computed directly from the signal by applying transform functions. It notes that these features are completely different among emotion categories; see Figure 2. Then, a computed feature is used to train a feature extraction model to minimize the computational overhead and extract these specific features. Finally, feature-conversion is applied to hide the sensitive data, as shown in Figure 3, and re-generate the speech signal using WORLD, which has synthesis algorithm to generate high-quality synthesized speech.
The determination of sensitive features in speech is according to the specific task for speech processing. By assuming the sensitive information to be hidden is the speaker’s emotion (decrees the emotion recognition) and the desired speech processing is to analyse its content script and identify the current speaker (maintain the speech and speaker recognition), the settings for the experiment were as follows:
4.1 Emotional Speech Dataset
Speech audio-only files (16bit, 48kHz .wav) from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [12] were used as the dataset. It contains 1440 files: 60 recorded per actor x 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions include calm, happy, sad, angry, fearful, surprise, and disgust. While two expressive styles are used as case studies (happy and angry), the proposed method can be applied to other emotions.
4.2 Experimental Settings
Pre-processing and Feature Extraction The dataset is reconfigured by placing different emotional speech together as source and placing the natural speech as a target. Then, waves are downsampled to 16.00 kHz and acoustic parameters logarithmic fundamental frequency (log F0), spectral envelopes (SEs), and periodicities (APs) are extracted with a 5 ms frame.
Emotion Conversion is another form of the voice conversion which focuses on prosody parameter transformations of the speech. In [9] a data-driven emotion conversion system has proposed to output expressive speech by transforming a natural utterance to three target emotion: anger, surprise, and sadness. Moreover, Lopez et al. in [13] presented a speaking style conversion method to convert the normal utterance to Lombard speech. First, F0 is estimated and then duration and spectral conversion (vocal tract characteristics) are implemented. The key step in emotion conversion is transforming the spectral envelope of each F0 segment using CycleGAN-VC2 [10], which has been trained on the RAVDESS dataset. Spectral features are mapped using CycleGAN from utterances spoken in emotional ways to corresponding features of normal speech. Finally, the mapped features are converted to normal speech waveforms using WORLD.
Speech Recognition Google Cloud Speech-to-Text API is used to evaluate the speech recognition of the generated audio files. It can recognize real-time streaming using Google’s machine learning technology [15].
Speaker Recognition is an extremely challenging task, and it requires high performance often under real-time conditions. To ensure that the proposed system is highly confident that a person is speaking and has been correctly identified, a trained model on VoxCeleb2 [4] has been used [28]. All audio files are converted to 16-bit streams at a 16kHz for consistency. Spectrograms are then generated of size 512 x 300 for 3 seconds of speech. Mean and variance normalisation is performed on every frequency bin of the spectrum. These spectrograms are then used as input to the CNN.
Emotion Recognition aims to automatically identify the affective state of the users. An emotion classification model based on RAVDESS dataset has been used to predicts 7 emotion classes which are the following: 0 = neutral, 1 = calm, 2 = happy, 3 = sad, 4 = angry, 5 = fearful, 6 = disgust, 7 = surprised [15].
The studies have differed between analysing the breakthroughs on the voice-enabled systems themselves to reveal the users’ privacy by analysing their communications.
Voice-controlled Privacy Voice is considering as one of the unique biometric information that has been widely used in various IoT applications. Google Home, Amazon Alexa, and Apple Siri are a famous voice-based smart personal assistant. Many privacy and security breach of voice-based systems have been reported in literature. For example, the adversary can build an acoustic model of the victims and re-generate any provided text by using that model. Spoofing the voice-based authentication system will allow the attackers to illegal access to speaker private information [27].
Privacy-preserving Voice Conversion The voice is a very significant to serve as a good index for several traits and physical characteristics. Several sensitive information has been extracted from the voice input such as emotions [26] and health state [20, 24]. For example, the age, height, and weight of a speaker can be predicted based solely on hearing his or her voice [11]. Further, the physical strength of the individuals, especially men, can be assessed based only on hearing the sound of their voice [25]. Mairesse et al. [14] proposed classification, regression and ranking models to learn the Big Five personality traits of a speaker. To tackle this issue, voice conversion has been used to support the preservation of user privacy by hiding the sensitive representation. The key concept of the voice conversion is how to convert the speech signal into target speaker while preserving the linguistic contents [3]. In [22], VoiceMask is proposed to mitigate the security and privacy risks by concealing voiceprints and adding differential privacy. It sanitized the audio signal received from the microphone by hiding the speaker’s voiceprint and then sending the perturbed speech to the
Figure 4: Accuracy comparison between three speech analysis tasks: speaker recognition, speech recognition and emotion recognition
voice input apps or the cloud.[5]
In general, the proposed framework has tested over 40 emoational recorded from RAVDASS dataset which is diffrent from the training set. These Wave files in particular style have converted to another (emotion-to-normal) via emotion conversion. Then, a comparison of the linguistics information between the converted and the original speech signal has done to evalute the effect of the conversion process on it. The proposed method was successfully able to hide the real emotion with drop of emotion recognition accuracy by 96 % percent, albeit with a decrease in speech recognition accuracy where the average word error rate (WER) is 35 %. In addition, speaker recognition performance is measured by the equal error rate (EER), which is the rate at which both acceptance and rejection errors are equal. The speaker recognition accuracy has a slight decrease of 0.12 %. By increasing the training epoch, the results will be improved significantly. Therefore, it is shown that the proposed method can achieve the preservation of privacy in speech analysis. Figure 4 presented the comparative results between the original and generated voice signals among the three speech analysis tasks: speech recognition, speaker identification, and emotion classification.
Voice-based systems continue to enhance user experience. Therefore, voice anonymisation method for voice input is suggested to trade-off between the signal utility and speaker privacy. The challenge is how to sanitize the speech without degrading the speech recognition accuracy. The evaluation results show the effectiveness of the proposed method in terms of projecting away sensitive representations (emotion) while preserving the speech quality. Further experiments will be conducted for speech analysis and achieving adaptive speech recognition while preserving speaker privacy.
Additionally, to further strengthen the user privacy we will include filtering speech content to prevent similar outcomes using other techniques, such as sentiment analysis.
[1] [n. d.]. Cloud Speech-to-Text - Speech Recognition | Cloud Speech-to-Text | Google Cloud. https://cloud.google.com/speech-to-text/
[2] Efthimios Alepis and Constantinos Patsakis. 2017. Monkey says, monkey does: security and privacy on voice assistants. IEEE Access 5 (2017), 17841–17851.
[3] D Childers, B Yegnanarayana, and Ke Wu. 1985. Voice conversion: Factors responsible for quality. In on Acoustics, Speech, and Signal Processing, Vol. 10. IEEE, 748–751.
[4] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. Vox-Celeb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018).
[5] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. Vox-celeb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018).
[6] Carl Doersch and Andrew Zisserman. 2017. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision. 2051–2060.
[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
[8] Karol Grabowski, Agnieszka Rynkiewicz, Amandine Lassalle, Simon Baron-Cohen, Björn Schuller, Nicholas Cummins, Alice Baird, Justyna Podgórska-Bednarz, Agata Pieniążek, and Izabela Łucka. 2019. Emotional expression in psychiatric conditions: New technology for clinicians. Psychiatry and clinical neurosciences 73 (2019), 50–62.
[9] Zeynep Inanoglu and Steve Young. 2007. A system for transforming the emotion in speech: Combining data-driven conversion techniques for prosody and voice quality. In Eighth Annual Conference of the International Speech Communication Association.
[10] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. 2019. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.
[11] Robert M Krauss, Robin Freyberg, and Ezequiel Morsella. 2002. Inferring speakersâĂŹ physical attributes from their voices. Journal of Experimental Social Psychology 38, 6 (2002), 618–625.
[12] Steven R Livingstone and Frank A Russo. 2018. The Ryerson AudioVisual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one (2018), e0196391.
[13] Ana Ramírez López, Shreyas Seshadri, Lauri Juvela, Okko Räsänen, and Paavo Alku. 2017. Speaking Style Conversion from Normal to Lombard Speech Using a Glottal Vocoder and Bayesian GMMs. 1363–1367.
[14] François Mairesse, Marilyn A Walker, Matthias R Mehl, and Roger K Moore. 2007. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of artificial intelligence
research 30 (2007), 457–500.
[15] Marcogdepinto. 2019. marcogdepinto/Emotion-Classification-Ravdess. https://github.com/marcogdepinto/Emotion-Classification-Ravdess
[16] Masanori Morise. 2015. CheapTrick, a spectral envelope estimator for high-quality speech synthesis. Speech Communication (2015), 1–7.
[17] Masanori Morise. 2016. D4C, a band-aperiodicity estimator for high-quality speech synthesis. Speech Communication (2016), 57–65.
[18] Masanori Morise, Hideki Kawahara, and Haruhiro Katayose. 2009. Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech. In Audio Engineering Society Conference: 35th International Conference: Audio for Games.
[19] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems (2016), 1877–1884.
[20] Iosif Mporas and Todor Ganchev. 2009. Estimation of unknown speak-erâĂŹs height from speech. International Journal of Speech Technology 12, 4 (2009), 149–160.
[21] Scott R Peppet. 2014. Regulating the internet of things: first steps toward managing discrimination, privacy, security and consent. Tex. L. Rev. 93 (2014), 85.
[22] Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiang-Yang Li. 2018. Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems. ACM, 82–94.
[23] Björn Schuller and Anton Batliner. 2013. Computational paralinguistics: emotion, affect and personality in speech and language processing. John Wiley & Sons.
[24] Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, et al. 2013. The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism.
[25] Aaron Sell, Gregory A Bryant, Leda Cosmides, John Tooby, Daniel Sznycer, Christopher Von Rueden, Andre Krauss, and Michael Gurven. 2010. Adaptations in humans for assessing physical strength from the voice. Proceedings of the Royal Society B: Biological Sciences 277, 1699 (2010), 3509–3518.
[26] George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5200–5204.
[27] Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. 2015. Spoofing and countermeasures for speaker verification: A survey. speech communication (2015), 130– 153.
[28] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2019. Utterance-level Aggregation For Speaker Recognition In The Wild. arXiv preprint arXiv:1902.10107 (2019).
[29] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223–2232.