The earliest successful approach to speaker recognition used the Gaussian mixture modeling (GMM) from the training data followed by an adaptation using maximum-aposteriori (MAP) rule [1]. The development of i-vectors as fixed dimensional front-end features for speaker recognition tasks was introduced in [2, 3]. Recently, neural network embeddings trained on a speaker discrimination task were also derived as features to replace the i-vectors. These features called x-vectors [4] were shown to perform better than the i-vectors for speaker recognition [5].
Following the extraction of x-vectors/i-vectors, different pre-processing steps are employed to transform the embeddings. The common steps include linear discriminant analysis (LDA) [3], unit length normalization [6] and within-class covariance normalization (WCCN) [7]. The transformed vectors are modeled with probabilistic linear discriminant analysis (PLDA) [8]. The PLDA model is used to compute a log likelihood ratio from a pair of enrollment and test embeddings which is used to verify whether the given trial is a target or non-target.
In this paper, we propose a neural back-end model which jointly performs pre-processing and scoring. It operates on pairs of x-vector embeddings (a pair of enrollment and test x-vectors), and outputs a score that allows the decision of target versus non-target hypotheses. The implementation using neural layers allows the entire model to be learnt using a speaker verification cost. The use of conventional cost functions like binary cross entropy tend to overfit the model to the training speakers, thereby performing poorly on evaluation sets. In an attempt to avoid this, we use the NIST SRE normalized detection cost [9] to optimize the neural back-end model. With several experiments on the NIST SRE 2018 development and evaluation dataset, we show that the proposed approach improves significantly over the state-of-the-art x-vector based PLDA system.
The rest of the paper is organized as follows. In Section 2, we highlight relevant prior work done in the field of discriminative back-end for speaker verification. Section 3 describes the front-end configurations used for feature processing and x-vector extraction. Section 4 describes the proposed neural network architecture used, and the connection with generative PLDA model. In Section 5, we present a smooth approximation to the NIST SRE detection cost function, and discuss regularization methods. This is followed by discussion of results in Section 6 and a brief set of concluding remarks in Section 7.
The common approaches for scoring in speaker verification systems include support vector machines (SVMs) [10], Gaussian back-end model [11, 12] and the probabilistic linear discriminant analysis (PLDA) [8]. Some efforts on pairwise generative and discriminative modeling are discussed in [13–15]. The discriminative version of PLDA with logistic regression and support vector machine (SVM) kernels has also been explored in [16]. In this work, the authors use the functional form of the generative model and pool all the parameters needed to be trained into a single long vector. These parameters are then discriminatively trained using the SVM loss function with pairs of input vectors. The discriminative PLDA (DPLDA) is however prone to over-fitting on the training speakers and leads to degradation on unseen speakers in SRE evaluations [17]. The regularization of embedding extractor network using a Gaussian back-end scoring has been investigated in [18].
Recently, end-to-end approaches to speaker verification have also been examined. For example, in [19], the i-vector extraction with PLDA scoring has been jointly derived using a deep neural network architecture and the entire model is trained using a binary cross entropy training criterion. The use of triplet loss in end-to-end speaker recognition has shown promise for short utterances [20]. Wan et. al. [21] proposed a generalized end-to-end loss inspired by minimizing the centroid mean of within speaker distances while maximizing across speaker distances. However, in spite of these efforts, most of the successful systems for SRE evaluations continue to use the generative PLDA back-end model.
In this paper, we argue that the major issue of over-fitting in discriminative back-end systems arises from the choice of the model and loss function. In the detection cost metrics () for SRE, the false-alarm errors have more significance compared to miss errors. Thus, incorporating the SRE evaluation metric directly in the optimization avoids the over-fitting problem. Further, by training multiple pre-processing steps along with the scoring module, the model learns to generate representations that are better optimized for the speaker verification task.
In this section, we provide the description of the front-end feature extraction and x-vector model configuration.
3.1. Training
The x-vector extractor is trained entirely using speech data extracted from combined VoxCeleb 1 [22] and VoxCeleb 2 corpora [23]. These datasets contain speech extracted from celebrity interview videos available on YouTube, spanning a wide range of different ethnicities, accents, professions, and ages. For training the x-vector extractor, we use 1, 276, 888 segments from 7323 speakers selected from Vox-Celeb 1 (dev and test), and VoxCeleb 2 (dev).
This x-vector extractor was trained using 23 dimensional MelFrequency Cepstral Coefficients (MFCCs) from 25 ms frames shifted every 10 ms using a 23-channel mel-scale filterbank spanning the frequency range 20 Hz - 3700 Hz. A 5-fold augmentation strategy is used that adds four corrupted copies of the original recordings to the training list [4, 5]. The augmentation step generates 6, 384, 440 training segments for the combined VoxCeleb set.
3.2. The x-vector extractor
For x-vector extraction, an extended TDNN with 12 hidden layers and rectified linear unit (RELU) non-linearities is trained to discriminate among the nearly 7000 speakers in the training set [5]. The first 10 hidden layers operate at frame-level, while the last 2 layers operate at segment-level. There is a 1500-dimensional statistics pooling layer between the frame-level and segment-level layers that accumulates all frame-level outputs using mean and standard deviation. After training, embeddings are extracted from the 512 dimensional affine component of the 11th layer (i.e., the first segment-level layer). More details regarding the DNN architecture and the training process can be found in [5].
Following the x-vector extraction, the embeddings are centered (mean removed), transformed using LDA and unit length normalized. The PLDA model on the processed x-vector for a given recording is,
where is the x-vector for the given recording,
is the latent speaker factor with a prior of
characterizes the speaker sub-space matrix and
is the residual assumed to have distribution
For scoring, a pair of x-vectors, one from the enrollment recording and one from the test recording
are used with the pretrained PLDA model to compute the log-likelihood ratio score as,
Fig. 1. Neural PLDA Net Architecture: The two inputs are the enrollment and test x-vectors which constitute a trial.
where,
with . In the proposed pairwise discriminative network (Neural PLDA) (Fig. 1), we construct the pre-processing steps of LDA as first affine layer, unit-length normalization as a non-linear activation and PLDA centering and diagonalization as another affine transformation. The final PLDA pairwise scoring given in Eq. 2 is implemented as a quadratic layer in Fig. 1. Thus, the Neural PLDA implements the pre-processing of the x-vectors and the PLDA scoring as a neural back-end. The model parameters of the Neural PLDA can be initialized with the baseline system and these parameters can be learned in a backpropagation setting.
To train the Neural PLDA for the task of speaker verification, it is required to sample pairs of x-vectors representing target (from same speaker) and non-target hypothesis (from different speakers). We train the model using the trials from previous NIST SRE evaluation sets along with randomly sampled target and non-target pairs which are matched by source and gender. The following error functions can be used in the Neural PLDA,
5.1. Binary Cross Entropy
The standard objective for a two class classification task.
where is the score for the
is the binary target for the trial and N is the number of trials.
Using this loss alone for training may result in over-fitting. Hence, a regularization term can be used by regressing to raw PLDA scores generated from Kaldi. The regularized cross-entropy loss is given as:
The second term encourages the scores from the Neural PLDA to not digress from the generative model PLDA scores drastically.
5.2. Soft Detection Cost
The NIST SRE 2018 normalized detection cost metric [9] is defined as:
where are the probability of miss and false alarms computed by applying detection threshold of
Here, 1 is the indicator function. The normalized detection cost function (Eq. 7) is not a smooth function of the parameters due to the step discontinuity induced by the indicator function 1, and hence, it cannot be used as an objective function in a neural network. We propose a differentiable approximation of the normalized detection cost by approximating the indicator function with a sigmoid function.
By choosing a large enough value for , the approximation can be made arbitrarily close to the actual detection cost function for a wide range of thresholds.
The primary cost metric of the NIST SRE 2018 for the Conversational Telephone Speech (CTS) is given by
where . We compute the Neural PLDA loss function as
where are the thresholds which minimizes
The minimum detection cost is achieved at a threshold where
is minimized. In other words, it is the best cost that can be achieved through calibration of the scores. We include these thresholds in the set of parameters that the neural network learns to minimize
through backpropagation. Finally, we compute an affine calibration transform using the SRE 2018 development set.
We perform several experiments with the proposed neural net archi- tecture and compare them with various discriminative back-ends previously proposed in the literature such as the discriminative PLDA [16] and pairwise Gaussian back-end [13]. We also compare the performance with the baseline system using Kaldi recipe that implements the generative PLDA model based scoring.
For all the pairwise generative/discriminative models, we train the back-end using the trials sampled from previous NIST SRE evaluation sets along with randomly sampled target and non-target pairs which are matched by source and gender. We use about 5.3 mil- lion trials for this training sampled from NIST SRE 04-10 as well as the NIST SRE16 trials. We also sample training data from Mixer-6 and Switchboard 1&2 corpora. The evaluation of the models are performed on the telephone conditions (CMN2) and the video conditions (VAST) of the NIST SRE 2018 challenge.
6.1. Kaldi PLDA Baseline
The primary baseline to benchmark our systems is the PLDA back-end implementation in the Kaldi toolkit. The Kaldi implementation models the average embedding x-vector of each training speaker. The x-vectors are centered, dimensionality reduced using LDA, followed by unit length normalization. By setting various dimensions, the best performance on SRE 2018 development set was achieved with LDA dimension of 170. The linear transformations and the Kaldi PLDA matrices are used to initialize the proposed pairwise PLDA network.
6.2. Discriminative PLDA (DPLDA)
In [16], an expanded vector representing a trial
was computed using a quadratic kernel as follows:
The PLDA log likelihood ratio score can be written as the dot product of a weight vector w and the expanded vector
We implemented the DPLDA in PyTorch by expanding the centered, LDA transformed and length normalized x-vectors from Kaldi baseline. Once the weight vector w is trained, the score on the test trials was performed using the inner product of the weight vector with the quadratic kernel.
6.3. Pairwise Gaussian Back-end (GB)
The Pairwise Gaussian Back-end [15, 24] models the pairs of enrollment and test x-vectors, . The x-vector pairs are modeled using a Gaussian distribution with parameters
target trials while the non-target pairs are modeled by a Gaussian distribution with parameters
. These parameters are estimated by computing the sample mean and covariance matrices of the target and non-target trials in the training data. The log-likelihood ratio (LLR) for a new trial is then obtained as:
The Gaussian Back-end is also trained on the same pairs of target and non-target x-vector trials, after centering, LDA and length normalization.
Table 1. Summary of results of various back-end models on CMN2 and VAST datasets reported on the SRE 2018 development and evaluation datasets.
6.4. Neural PLDA
We perform various experiments using the neural PLDA architecture with different initialization methods and loss functions. We also experiment with the role of batch size parameter, the learning rate as well as the choice of loss function in the optimization. The optimal parameter choices were based on the SRE 2018 development set.
For the binary cross entropy (BCE) loss function and the soft detection cost functions, we need to apply the sigmoid function on the scores at different thresholds. In this work, we also parameterize the threshold value and let the network learn the threshold value to minimize the loss.
The soft detection cost function is highly sensitive to small changes in false alarm probability. Hence, all experiments were conducted with a large batch sizes of 4096/8192. The learning rate was initialized to and halved each time the validation loss increased twice in a row.
6.5. Discussion of Results
The performance of the various back-end systems are reported in Table 1. The PLDA baseline generalized considerably well for both development and evaluation sets for both CMN2 and VAST sources. The Discriminative PLDA (DPLDA) is found to perform well on the VAST set, but it fails to generalize on CMN2 conditions. The Pairwise GB model also performs better than Kaldi’s PLDA baseline on VAST dataset, which is in line with what was observed previously in [24]. The Neural PLDA model with random initialization of all parameters performed significantly better than the DPLDA model on the development set, and marginally better on the evaluation set. We hypothesize this to be a result of the network architecture which has fewer parameters, and hence fewer degrees of freedom than DPLDA model which results in better generalization. When the parameters are initialized with the Kaldi PLDA back-end parameters, the discriminative training further improves the performance on the dev set. The soft detection cost function helps further reduce the
and generalizes much better than using only the cross-entropy loss alone. We observe significant relative improvements over the PLDA Baseline of 10% and 38% in terms of on the SRE 2018 Development sets, respectively on CMN2 and VAST conditions. On the SRE 2018 Evaluation set, the proposed apporach yields relative improvements of 7% and 23% for CMN2 and VAST conditions.
This paper presents a step in the direction of exploring discrimi- native models for the task of speaker verification. Discriminative models allow the construction of end-to-end systems. However, discriminative models tend to overfit to the training data. In our proposed model, we constrain the parameter set to have lesser degrees of freedom, in order to achieve better generalization. We also propose a task specific differentiable loss function which approximates the NIST SRE 2018 detection cost.
It is important to note that unlike cross entropy loss, the NIST SRE detection cost gives significantly more importance to the false alarms. We also find that initializing the proposed neural PLDA model using generative model parameters allows the model to improve over the baseline system performance.
We observe considerable improvements and better generalization with our proposed approach. We could attribute this to the choice of architecture as well as the choice of loss functions.
The authors would like to thank the Ministry of Human Resources Development (MHRD) of India and the Department of Science and Technology (DST) for their support. We would also like to thank Bhargavram Mysore and Anand Mohan for the valuable discussions and their help during the SRE 2018 and 2019 Evaluation.
[1] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000.
[2] Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007.
[3] Najim Dehak, Patrick J Kenny, R´eda Dehak, Pierre Du- mouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
[4] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
[5] Mitchell Mclaren, Diego Cast´an, Mahesh Kumar Nandwana, Luciana Ferrer, and Emre Yilmaz, “How to train your speaker embeddings extractor,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 327–334.
[6] Daniel Garcia-Romero and Carol Y Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
[7] Andrew O Hatch, Sachin Kajarekar, and Andreas Stolcke, “Within-class covariance normalization for svm-based speaker recognition,” in Ninth international conference on spoken language processing, 2006.
[8] Patrick Kenny, “Bayesian speaker verification with heavytailed priors.,” in Odyssey, 2010, pp. 14–21.
[9] Omid Sadjadi, “NIST 2018 Speaker Recognition Evaluation Plan,” https://www.nist.gov/sites/default/ files/documents/2018/08/17/sre18_eval_ plan_2018-05-31_v6.pdf.
[10] William M Campbell, Douglas E Sturim, and Douglas A Reynolds, “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Process. Lett., vol. 13, no. 5, pp. 308–311, 2006.
[11] Mitchell McLaren, Aaron Lawson, Yun Lei, and Nicolas Scheffer, “Adaptive Gaussian backend for robust language identification.,” in INTERSPEECH, 2013, pp. 84–88.
[12] Mohamed Faouzi BenZeghiba, Jean-Luc Gauvain, and Lori Lamel, “Language score calibration using adapted gaussian back-end,” in Tenth Annual Conference of the International Speech Communication Association, 2009.
[13] Sandro Cumani, Niko Br¨ummer, Luk´aˇs Burget, Pietro Laface, Oldˇrich Plchot, and Vasileios Vasilakakis, “Pairwise discriminative speaker verification in the i-vector space,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 6, pp. 1217–1227, 2013.
[14] Sandro Cumani and Pietro Laface, “Large-scale training of pairwise support vector machines for speaker recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, no. 11, pp. 1590–1600, 2014.
[15] Sandro Cumani and Pietro Laface, “Generative pairwise mod- els for speaker recognition,” in Odyssey, 2014, pp. 273–279.
[16] Luk´aˇs Burget, Oldˇrich Plchot, Sandro Cumani, Ondˇrej Glem- bek, Pavel Matˇejka, and Niko Br¨ummer, “Discriminatively trained probabilistic linear discriminant analysis for speaker verification,” in 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011, pp. 4832–4835.
[17] Jes´us Villalba, Nanxin Chen, David Snyder, Daniel Garcia- Romero, Alan McCree, Gregory Sell, Jonas Borgstrom, Leibny Paola Garc´ıa-Perera, Fred Richardson, R´eda Dehak, et al., “State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations,” Computer Speech & Language, p. 101026, 2019.
[18] Luciana Ferrer and Mitchell McLaren, “Optimizing a speaker embedding extractor through backend-driven regularization,” Proc. Interspeech 2019, pp. 4350–4354, 2019.
[19] Johan Rohdin, Anna Silnova, Mireia Diez, Oldˇrch Plchot, Pavel Matˇejka, and Luk´aˇs Burget, “End-to-end DNN based speaker recognition inspired by i-vector and PLDA,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4874–4878.
[20] Chunlei Zhang and Kazuhito Koishida, “End-to-end textindependent speaker verification with triplet loss on short utterances.,” in Interspeech, 2017, pp. 1487–1491.
[21] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
[22] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
[23] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090.
[24] Shreyas Ramoji, Anand Mohan, Bhargavram Mysore, Anmol Bhatia, Prachi Singh, Harsha Vardhan, and Sriram Ganapathy, “The LEAP speaker recognition system for NIST SRE 2018 challenge,” in Proc. of ICASSP, 2019, pp. 5771–5775.