In recent years, speech source separation becomes an active research area. Speech source separation separate mixture speech signal in signal space. Traditionally, speech source separation is viewed as a signal processing problem, different approaches are proposed such as CASA [1]. Matrix factorization methods are also widely used in speech source separation, such as NonNegative Matrix Factorization (NMF) [2, 3]. Independent component analysis (ICA) [4, 5, 6, 7, 8]. With the rapid growth of deep learning, some deep learning approaches was used to separate speech signal, such as supervised separation [9, 10, 11, 12], deep clustering and deep attractor network [13, 14, 15].
However, separating speech signal from two-speaker signal is still a challenging task. Speech signals are high dimensional, and different speaker properties in two-speaker signals are highly co-related to each other, which would influence the quality of the output.
Instead of separating speech signal in signal space, de-mixing different speaker properties from two-speaker signal in embedding space might be more efficient. Speaker embedding is low dimensional, and it can project variable length acoustic signal into fixed length embedding space[16]. This property of speaker embedding makes it convenient to be further used comparing with that in signal space. The obtained speaker embeddings might be beneficial for downstream tasks such as speaker identification [17, 18, 19, 20] and speech recognition [21, 22].
In this work, we propose a speaker embedding de-mixing approach for separating speaker embeddings in two-speaker signal. The proposed approach contains two steps: in step one, we propose to use a residual TDNN network to learn high quality speaker embeddings from clean speech data. After training, the embedding of each speaker are extracted and collected. In step two, a speaker embedding de-mixing network is trained. The embedding de-mixing network takes two-speaker signal as input, as well as the embedding of one of the speakers in the two-speaker signal. The output will be the embedding of the other speaker. The speaker embedding de-mixing network is trained using reconstruction loss. When the embedding of one of the speaker is available, the system will generate the embedding of the other speaker that appears in the input signal. In other words, suppose the input data contains a target speaker and a interfering speaker. The proposed approach takes the two-speaker mixture signal as input, as well as the embedding of the interfering speaker. The output would be the embedding of the target speaker; Or inversely, the proposed approach takes the two-speaker mixture signal and the embedding of the target speaker as input, the obtained embedding would be the embedding of the interfering speakers.
To the best of our knowledge, the proposed approach is the first that trying to directly de-mix speaker embedding from two-speaker signal. This is also the main contribution of this work. The benefits of the proposed approach is manifold: Suppose in a home device, the embedding of some speakers might be available. The proposed approach might be beneficial for obtaining the embedding of the other speaker in two-speaker signal. The de-mixed speaker embedding might be further used for some downstream tasks, such as speaker verification [23, 24] and speech recognition [25].
The rest of this report is organized as follow: Section 2 introduces the model architectures in both step one and step two. Section 3 introduces the experiments design, including data and use, and experiment setup. In Section 4, results are shown, followed by discussion and analysis. Section 5 introduces the conclusion and the future work plan and Section 6 contains the acknowledgements.
In this section, the model structure in this work is introduced. which consists of two steps, step one: learning clean speaker representation; Step two: using the learned speaker embedding to train a speaker embedding de-mixing network. The goal for step one is to learn high quality embeddings for each speaker in the dataset. In step two, two-speaker mixture data is firstly projected into embedding space, resulting in mixture embedding . The mixture embedding and the embedding of one of the speakers are both input to a de-mixing function. The output is the estimation of the embedding from the other speaker
2.1. Step One: Learning High Quality Speaker Representations
Figure 1: The diagram of step one. A is noted as the residual TDNN based speaker embedding extractor. B is denoted as the speaker embedding classifier.
Figure 1 shows the diagram of step one. The clean speech signal is input to a speaker identification network A. After training A, the embedding for each speaker is extracted from the bottleneck layer of A. classifier B is used to evaluate the quality of the learned speaker embeddings.
Figure 2: The architecture of speaker embedding extractor A.
Figure 2 and Table 1 show the architecture of A. In order to learn high quality and robust speaker embeddings, A
Figure 3: The diagram of Step Two. C is the speaker embedding de-mixing network. B is the fixed speaker embedding classifier that trained on step one.
Table 1: Architecture of the speaker embedding network A
is designed based on TDNN architecture, as TDNN architecture shows high robustness and it can better capture time relevant information [26]. There are three parts within the architecture of A: frame-level feature extractor, statistics pooling and segment-level feature extractor.
In frame-level feature extractor, the network consists of TDNN layers and residual TDNN blocks. The input data is firstly passed through into two TDNN layers. Then, three residual TDNN blocks are used. The last TDNN layer transforms the feature dimension into 1500. The use of residual TDNN blocks instead of using normal TDNN layers like X-vectors might increase the robustness of the learned embeddings [27].
Statistics pooling operation is then used, the output is feed into the segment-level feature extractor. There are two fully-connected layers in segment-level feature extractor. The speaker embedding is extracted from the last fully-connected layer.
For the architecture of classifier B, a simply architecture is chosen: a fully connected network with one hidden layer with 512 nodes.
2.2. Step Two: De-mixing of Speaker Representation in Embedding Space
After collected the high quality embeddings for each speaker in step one, step two learns the de-mixing function of the mixture embeddings.
Figure 3 shows the diagram of step two: Suppose the input data contains two speakers: . In step one, both of the high quality embeddings of are learned and obtained,
Figure 4: Model Architecture of speaker embedding de-mixing network C. C consists of pre-trained speaker embedding extractor and the de-mixing function.
which are denoted as . Given the input mixture data, C firstly transforms it in embedding space, results in mixture embedding . Then, a de-mixing function is learned to remove the information of the speakers and remains the other.
Classifier B is used to evaluate the quality of de-mixed embedding. It is noticeable that the parameters of B is trained once in step one, in step two the parameters of B are fixed.
More Specifically, Figure 4 illustrates the architecture of de-mixing network C. The input mixture data that contains contains two-parts: the first part contains the pre-trained speaker embedding extractor in step one, the goal is to project the input data in embedding space. The output of the pre-trained speaker embedding extractor is consists of the mixture embedding of two speakers: Then, and the clean embedding e2 (trained and collected from step one) are input to a de-mixing function f (shows in Eq 1). The output is estimated embedding of the other speaker
A reconstruction loss L (shows in Eq 2) is applied between . In this work, mean absolute error [28] is applied as the reconstruction loss.
2.3. The architecture of the de-mixing function f
The de-mixing function f might have different choices. In this work, six possible methods are investigated. Figure 5 illustrates the six different methods of f: (a) Subtraction; (b) Multiplication; (c) Concatenation with one fully-connected layer (d) Concatenate with two fully-connected layers; (e) Shared FullyConnected Layer with Concatenation and (f) Separated FullyConnected Layer with Concatenation.
2.3.1. Subtraction
The first one is a subtraction operation of is equation 3 and Figure 5 (a)). After subtraction, the subtracted embedding vector is passed through a fully-connected layer without activation function (could be viewed as a linear transformation). This method is further referred to ”Sub”. The embedding dimension is denoted as are the parameters of the fully-connected layer.
2.3.2. Multiplication
Multiplication approach (further referred to ”Mul”) is similar with ”Sub” method. The only difference is is multiplied with instead of subtracted. Figure 5 (b) and Equation 4 shows the architecture of ”Mul” method. denotes elementwise multiplication.
2.3.3. Concatenate with one fully-connected layer
In the third method, are firstly concatenated together, and then feed through a fully connected layer (shows in Equation 5 and Figure 5 (c)). the concatenated vector of . This method is further referred to ”Concat1”. parameters for the fully connected layer, denotes matrix multiplication.
2.3.4. Concatenate with two fully-connected layers
The next method is concatenate with two fully-connected layers. Similar with the previous method, are firstly concatenated together, and then feed into two fully connected layers instead of one (shows in Equation 6 and Figure 5 (d)). The first fully-connected layer uses Relu activation function while there are no activation function after the second layer.
This method is further referred to ”Concat2”. and are parameters for the fully connected layer.
2.3.5. Shared Fully-Connected Layer with Concatenation
The last two methods are different from the above methods. In the fifth method, are firstly input to two fully connected layers respectively, the two fully connected layer share parameters. The output are then concatenated and feed into another fully connected layer (shows in Equation 7 and Figure 5 (e)). This method is further referred to ”ShareConcat”. are parameters for the fully connected layers.
Figure 5: Different architecture of de-mixing function f: (a) Subtraction; (b) Multiplication; (c) Concatenation with one fully-connected layer (d) Concatenate with two fully-connected layers; (e) Shared Fully-Connected Layer with Concatenation and (f) Separated Fully-Connected Layer with Concatenation.
2.3.6. Separated Fully-Connected Layer with Concatenation
The last one is similar with ”Share-Concat” method. are firstly input to two fully connected layers respectively, the two fully connected layers are separated, which means they do not share parameters. The output concatenated and input to another fully connected layer (shows in Equation 8 and Figure 5 (f)). This method is further referred to ”Separate-Concat”. are parameters of the fully connected layers.
3.1. Data
In this work, TIMIT corpus [29] is used. The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. It includes a 16-bit, 16kHz speech waveform file for each utterance. There are a total of 6300 utterances, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. 70% of the speakers are male and 30% are female. As two utterances of each speaker have the same word transcriptions, they are excluded in our work to reduce possible bias. So there are finally 8 utterances spoken by each speaker. In this paper, the train and test set are re-split. Six utterances from each speaker are randomly selected for training and the rest two utterances are for testing. Hence there are 3780 utterances in the training set and 1260 utterances in the test set.
In order to evaluate the performance in real world conditions, the multi-channel wall street journal audio visual corpus (MC-WSJ) [30] is also used in this work. MC-WSJ contains total number of 40 speakers reading WSJ sentences in three scenarios: single speaker stationary: A single speaker reading sentences from six positions in a meeting room; Single speaker moving: a single speaker moving between six positions while reading sentences; Overlapping speakers: two speakers reading sentences from different position. There are no speaker overlap between these three conditions.
In this work, the overlapping speakers audios are used. In the overlap version, there are 9 pairs of speakers contains 10 unique speakers. For each speaker pairs, there are 700 utterances in average. There are three different recording techniques: two microphone arrays, lapel and headset microphones
(b) Mul
(e) Share-Concat
Figure 6: The training process for the four different speaker de-mixing method. (a): Sub; (b): Mul; (c): Concat1; (d): Concat2; (e) Share-Concat; (f) Separate-Concat. X-axis represents number of epochs, Y-axis represents the mean absolute error.
Table 2: speaker identification accuracy of using the estimated embedding of target speaker . Before denotes the speaker identification directly using . Clean denotes speaker identification using that extracted from clean speech..
wore on all of the speakers. For all of the experiments in this work, 20 dimensional MFCC feature are used [26].
3.2. Experiment Setup
For TIMIT experiments, in step one, the speaker embeddings are learned using clean TIMIT training set. After training model A, for each speaker, 200 segments are randomly sampled and feed into A. The clean speaker embeddings are the average of the embeddings from each segments belongs to the same speaker. B is trained using the same training data with A.
In step two, as TIMIT data contains clean speech only, in order to generate mixture speech signal, each utterance in TIMIT dataset are randomly added with another utterance from the other speaker. More specifically, when generating mixture speech signal, one utterance contains target speaker sen, and an utterance from interfering speaker is randomly chosen. is viewed as the target speaker, and is the interfering speaker. The target speaker and the interfering speaker are mixed with a certain SNR (signal-to-noise ratio). Training data will only be added with training data, test data will only be added with test data. This is to avoid bias problem, as when training the separation model C, the model will not get access to any utterances from test set.
TIMIT experiment is separated into two parts: the first one is to use , in other words, this experiment using the embedding of interfering speaker to obtain that of target speaker. The second of is using , which is using the embedding of target speaker to obtain the embedding of interfering speaker.
For MC-WSJ experiments, in step one, the speaker embeddings are learned using the headset recorded audios from the overlapping speakers scenario. The headset recorded audios are close to the corresponding speaker, as a result, the audios in this kind of recording has the close quality of clean signal [30]. The same technique is used to generated and collect embedding for each speaker and training of classifier B.
In step two, the model C is trained and tested on two microphones recorded speech (microphone1 and microphone2). For each speaker pair, 70 utterances are randomly selected as the test utterances. Speaker identification accuracies are computed on this test set.
3.3. Implementation
In this work, the dimension of all of the fully connected layers are set to 512. Each layer is followed by a batch normalisation layer[31] except for the embedding layer. ReLU activation [32] is used for each layer except for the embedding layer. The Adam optimiser[33] is used in training, with set to 0.95, to 0.999, and . The initial learning rate is
Table 3: speaker identification accuracy of using estimated embedding of interfering speaker . Before denotes speaker identification directly using . Clean denotes speaker identification using that extracted from clean speech.
Table 4: The speaker identification results on MC-WSJ dataset.
Table 2 shows the results of using . In Table 2, speaker identification results of all of the six speaker de-mixing functions f in different SNR levels are shown. In this scenario, embeddings of speaker ) is input to C, the speaker iden-tification accuracy are computed by classifying the estimated embedding of
Among all of six speaker identification, comparing with not using f (directly classifying on mixture embeddings most of the architectures of f obtained better performance. This phenomenon shows that the speaker de-mixing process remove some of the influences of the information from the interfering speakers. The ”Separate-Concat” method obtained the best performance when SNR at 0dB and 5 dB, which is close to the results of clean speech. Even the SNR is -5 dB (the power of the interfering speaker is equals to the target speaker (”Separate-Concat” method can still reach 82.5% test accuracy. Figure 6 shows the training process of C of all of the six methods with SNR at 0 dB. From Figure 6 (f), the ”Separate-Concat” method obtained lowest reconstruction loss while faster convergence.
Among all of the six de-mixing methods, Sub obtained best performance when the SNR is -5 dB. Sub method only use simple subtraction operation, but it could reach 86.2% when the SNR is -5 dB (the power of the interfering speaker than to the target speaker (). This phenomenon shows that a simple mathematical operation and a linear transformation can be applied on the speaker embeddings to filter out some information of the interfering speaker.
”Mul” methods uses another mathematical operation (multiplication), and the performance obtained are still more close to the that of clean speech. This phenomenon shows multiplication is an alternative mathematical operation that can be used on speaker embedding space.
The ”Concat1”, ”Concat2” and ”Share-Concat” methods obtained lower results. The reason why the ”Concat1” and ”Concat2” obtained lower performance might because directly concatenating might influence the model C to distinguish different speaker properties. The low performance of ”Share-Concat” might have the same reason.
Table 3 shows the results of using which is using the embedding of target speaker to reconstruct the embedding of interfering speaker. All of the results of six methods shows lower but close performance of that of using to reconstruct . It shows that the ”Share-Concat” and Sub methods also have the ability to obtain high quality embedding of the interfering speaker from two-speaker environment.
Table 4 shows the experiments result of microphone1 (M1) and microphone2 (M2) in MC-WSJ dataset. The ”ShareConcat” method obtain the best results, reaching 93.9% test accuracy and 90.9% test accuracy in microphone1 and micro-phone2. The reason why the results of microphone2 is lower than that of array1 might be the distance of the speakers and microphones. The microphone microphone1 is closer to speakers while microphone2 is far from speakers [30].
Comparing with the results of headset recording, which reaches 99.1 % test accuracy, the results obtained by the ”Separate-Concat” method still have a gap. The reason might be in real world conditions, the two speakers are moving, the SNR between the target speaker and interfering speakers might be different at different time. It might be more difficult for the model to de-mix the embedding of two speakers.
In conclusion, in this work, a speaker embedding de-mixing ap- proach is proposed. The proposed approach reconstructs the embedding of target speaker from the embedding of interfering speaker and mixture embedding, or inversely, obtain the embedding of interfering speaker from that of target speaker and mixture embedding. The quality of embeddings are evaluated by speaker identification on the reconstructed embeddings. Results on TIMIT (artificially augmented two-speaker signal) and MC-WSJ (real world two-speaker signal) shows that within the six different de-mixing architectures, the ”Share-Concat” method obtain better results, which is close to the results of clean speech.
In this future work, more speaker mixture scenarios will be investigated, such as three-speaker mixture. Different model architecture might be investigated, and larger dataset might be used such as voxceleb1 and 2.
Funding for this research is provided by Huawei Innovation Re- search Program (HIRP): X/159898-11.
[1] AS Bregman, “Auditory scene analysis: The perceptual organization of sound. cambridge, ma, us,” 1990.
[2] Kevin W Wilson, Bhiksha Raj, Paris Smaragdis, and Ajay Divakaran, “Speech denoising using nonnegative matrix factorization with priors,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008, pp. 4029–4032.
[3] Bhiksha Raj, Tuomas Virtanen, Sourish Chaudhuri, and Rita Singh, “Non-negative matrix factorization based compensation of music for automatic speech recognition,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
[4] Hiroshi Saruwatari, Satoshi Kurita, Kazuya Takeda, Fumitada Itakura, Tsuyoki Nishikawa, and Kiyohiro Shikano, “Blind source separation combining independent component analysis and beamforming,” EURASIP Journal on Advances in Signal Processing, vol. 2003, no. 11, pp. 569270, 2003.
[5] Michael A Casey and Alex Westner, “Separation of mixed audio sources by independent subspace analysis.,” in ICMC, 2000, pp. 154–161.
[6] Jen-Tzung Chien and Bo-Cheng Chen, “A new indepen- dent component analysis for speech recognition and separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1245–1254, 2006.
[7] Soo-Young Lee, “Blind source separation and indepen- dent component analysis: A review,” Neural Information Processing-Letters and Reviews, vol. 6, no. 1, pp. 1–57, 2005.
[8] Shoji Makino, Hiroshi Sawada, and Shoko Araki, “Frequency-domain blind source separation,” in Blind Speech Separation, pp. 47–78. Springer, 2007.
[9] Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis, “Deep learning for monaural speech separation,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 1562–1566.
[10] Jun Du, Yanhui Tu, Yong Xu, Lirong Dai, and Chin-Hui Lee, “Speech separation of a target speaker based on deep neural networks,” in 2014 12th International Conference on Signal Processing (ICSP). IEEE, 2014, pp. 473–477.
[11] Yanhui Tu, Jun Du, Yong Xu, Lirong Dai, and Chin- Hui Lee, “Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers,” in The 9th International Symposium on Chinese Spoken Language Processing. IEEE, 2014, pp. 250–254.
[12] Jun Du, Yanhui Tu, Li-Rong Dai, and Chin-Hui Lee, “A regression approach to single-channel speech separation via high-resolution deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1424–1437, 2016.
[13] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 31–35.
[14] Zhuo Chen, Yi Luo, and Nima Mesgarani, “Deep attrac- tor network for single-microphone speaker separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 246– 250.
[15] Yi Luo, Zhuo Chen, and Nima Mesgarani, “Speakerindependent speech separation with deep attractor network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, 2018.
[16] Yu-An Chung and James Glass, “Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech,” arXiv preprint arXiv:1803.08976, 2018.
[17] David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan McCree, Daniel Povey, and Sanjeev Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5796–5800.
[18] Herman Kamper, Weiran Wang, and Karen Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4950–4954.
[19] David Snyder, Pegah Ghahremani, Daniel Povey, Daniel Garcia-Romero, Yishay Carmiel, and Sanjeev Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 165– 170.
[20] Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda, “Attentive statistics pooling for deep speaker embedding,” arXiv:1803.10963, 2018.
[21] Junzo Watada et al., “Speech recognition in a multi-speaker environment by using hidden markov model and mel-frequency approach,” in 2016 Third International Conference on Computing Measurement Control and Sensor Network (CMCSN). IEEE, 2016, pp. 80–83.
[22] Shruti Palaskar, Vikas Raunak, and Florian Metze, “Learned in speech recognition: Contextual acoustic word embeddings,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6530–6534.
[23] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
[24] Weidi Xie, Arsha Nagrani, Joon Son Chung, and An- drew Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5791–5795.
[25] Pavel Denisov and Ngoc Thang Vu, “End-to-end multi- speaker speech recognition using speaker embeddings and transfer learning,” arXiv preprint arXiv:1908.04737, 2019.
[26] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in ICASSP. IEEE, 2018.
[27] Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matˇejka, and Oldˇrich Plchot, “But system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
[28] Cort J Willmott and Kenji Matsuura, “Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance,” Climate research, vol. 30, no. 1, pp. 79–82, 2005.
[29] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1,” NASA STI/Recon technical report n, 1993.
[30] Mike Lincoln, Iain McCowan, Jithendra Vepa, and Hari Krishna Maganti, “The multi-channel wall street journal audio visual corpus (mc-wsj-av): Specification and initial experiments,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2005. IEEE, 2005, pp. 357–362.
[31] Sergey Ioffe and Christian Szegedy, “Batch normaliza- tion: Accelerating deep network training by reducing internal covariate shift,” arXiv:1502.03167, 2015.
[32] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus, “Regularization of neural networks using dropconnect,” in International conference on machine learning, 2013, pp. 1058–1066.
[33] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.