Vector Quantization (VQ) is a classical quantization technique that allows the modeling of probability density functions by the distribution of some prototype vectors. VQ-based systems first cluster the original data, and then use the center value of each cluster to represent them. In the past decades, VQ has been very popular in the field of speaker recognition due to its fast processing speed and good recognition performance. Although Gaussian Mixture Modeland i-vector
have shown better performance for speaker-related task in recent years, VQ-based systems are more suitable for the application that only have a small amount of training data in comparison to them. Matsui et al.
compared VQ based system with the Hidden Markov Models (HMM) based speaker identification system and the result shows that VQ based system is more robust than the HMM based system when there is only a small amount of training data.
Unfortunately, in real-world application, the background noise, channel mismatch and many other factors could cause the feature mismatching problem and reduce the performance of VQ-based model, which would heavily degrade the performance of the speaker verification performance. To solve this problem, speech enhancementand model adaptation approaches
have been widely used. Since the same noise may pose different distortion on different frames, by adopting several noisy constraints, Song et al.
designed a simple framework that can select noise invariant frames from original audio signals. The experiment results show that this framework can enhance the speaker verification performance for different speaker models under different noisy conditions. As for feature mismatching problem, Random Sample Consensus (RANSAC)
has been proved to be a very efficient method for eliminate mismatched feature points and has been applied to many image-based tasks, for example, image retrieval,
object recognition
etc. However, RANSAC utilizes the geometric information of matched image feature points which audio feature does not have. Meanwhile, Procruste-based method can also eliminate the mismatched feature points in image matching and the geometric information is not necessary for this method. Therefore, it is suitable for audio-based feature matching tasks. This paper proposes a two-stage
iterative Procrustes match (TIPM) approach aiming to remove the mismatched feature vectors for VQ-based speaker verification. At the first stage, TIPM will remove those mismatched feature vectors pairs and recycle incorrectly removed feature vectors pairs at the second stage.
To evaluate the performance of the proposed method, we conducted two experiments on TIMITdatabase by comparing the VQ-based speaker verification performance with and without using TIPM (illustrated in Figure 1).
Figure 1. System flow chart
The rest of this paper is structured as follows. Our detailed TIPM method is explained in Section 2. Experimental results are discussed in Section 4. The last section is devoted to the conclusion.
For VQ-based speaker verification, the vector set F obtained from a speaker can be grouped into several subsets, while each of them contains several feature vectors that are similar and may represent the same character of a speaker. Meanwhile, the matching speed would be heavily affected if there are a large number of feature vectors that need to be processed. Motivated by this, K-means is introduced to cluster feature vectors of each speaker. As a result, the final codebooks of all speakers can be generated, which are defined as
where N is the number of codebooks and is the number of clusters of codebook n
1
are the vectors of each cluster center of the codebook n. The matching score of the test trial determines whether an utterance would be accepted in a speaker verification system. Conventionally, the score is computed by the Euclidean metric or log likelihood rate. Although the NIFS scheme attempt to select the noise robustness frames, additive noises may still impose relative distortions on the selected frames. This would lead many incorrect matching. The Procrustes-based feature match used in
is one of the typical approaches to solve this problem. It takes the geometric distribution of feature vectors into consideration, which is expected to help differentiate speakers. In this section, a two-stage iterative Procrustes matching algorithm (TIPM) is proposed, which aims to minimize the number of mismatched feature vector pairs.
Assume that is the codebook of a speaker A and
is the NIFV of a test utterance, which selected by the first-stage NIFS, then a set of initial matching feature vector pairs could be obtained according to
where is the nearest neighborhood feature vector of
; is a threshold of Euclidean distance and
. Assume that the matrix
is established by the matched feature vectors from the codebook
, while their corresponding feature vectors from the test utterance
established the matrix
, based on the equation (14) and (15) the orthogonal matrix could be constructed.
where and
; S is the number of original matched pairs. The F-norm is minimized by the nearest orthogonal matrix, and
where ΨΥ is the singular value decomposition (SVD) of
. We define
) as the similarity measurement between
and
, which is the least square error of
and
in Procrustes match and the best case is
) would above zero. If a pair of matched vectors
in
and
in
are discarded, the process is called
and
. The vector pair
and
would be discarded from (7) and (8) , if
where is a iteration threshold and s = 1, 2, ..., S . As a result, a new pair of matrixes
and
could be obtained. Consequently, a new least square error
) is generated. Afterwards, by repeating the procedure mentioned above, another pair of feature vectors
and
may be removed from
and
. This leave-one-out procedure can proceed iteratively until equation (7) and (8) are no longer satisfied. Finally, n pairs of mismatched vectors are discarded from
and
, which can be denoted as
Then, a pair of new matrix and
are generated. However, some vector pairs may be removed incorrectly at the first stage. To recycle them, the second-stage of the Procrustes match is introduced. Suppose that the least square error of
and
is
) . We define that if a pair of points
and
is added from dis(p) to
and
, a new pair of matrices
and
would be generated. Then, by adding each pair of feature vectors in the dis(p) to
and
respectively, a set of
could be obtained. A pair of vectors
and
would be recycled from dis(p) and added to
and
if
where is a iteration threshold and i = 1, 2, L, n . Afterwards, by repeating the procedure mentioned above, a new pair of feature vectors
and
may be recycled from dis(p) and added to
and
. This add-one-in procedure can proceed iteratively until equation (9) and (10) are no longer satisfied. Consequently, the final matched matrices
and
are obtained, which contain the final matched vector pairs. (Figure 2. a) illustrates the relationship among different kinds of matched pairs.
It demonstrates that initial matched pairs are made up of mismatched pairs and the final matched pairs. (Figure 2) shows the change of matched vector pairs during TIPM.
Since the number of frames selected from each test utterance may differ, a relative metric is adopted for VQ-based systems (VQ baseline, NIFS-VQ, NIFS-TIPM-VQ) to measure the similarity between a codebook and a test utterance, given by
Figure 2. The principle of the TIPM
where )) is the number of the final matched vector pairs between the test utterance and the codebook, and
) is the number of selected frames in
. According to the equation(11), the larger number of matched vectors the test utterance has with a codebook, the higher similarity ratio between them. In order to minimize the inconsistency of feature vectors from the same speaker and maximize the inconsistency of feature vectors from different speakers, the similarity score would be further normalized by zero normalization according to
where L(V ) is the original score of sample and
are the estimated impostor parameters for speaker model;
) is the score after zero normalization .
3.1 Experiment setup
To evaluate the usefulness of the proposed TIPM, we conducted two experiments on the TIMIT database. The first one is applying TIPM directly to the feature vectors of each voice sample while the second one is to apply TIPM to the feature vectors that are already processed by the NIFS. Besides the clean condition, in order to test the performance of TIPM under noisy conditions, 12 noisy conditions consisting of four different noises with three different SNRs. Please see for details of the method for creating these noisy conditions. The measurement utilized here to evaluate the performance of the usefulness of TIPM are the Equal Error Rate (EER) values and DET curves.
3.2 Experiment result
We first apply TIPM to the traditional VQ system. Table compares the speaker verification result of the TIPMVQ system with it of the traditional VQ system. It is clear that the TIPM has improved the performance in clean environment and 10 noisy environments. After using TIPM, the performance falling down in only two noisy conditions with SNR of 15. Particularly, when applied TIPM under the clean and noisy conditions with high SNR (25), it achieved promising result in terms of relative improvement of EER.
Figure 3 and Table 1 demonstrate the DET and EER results between NIFS-VQ system and TIPM-NIFS-VQ system. It could be noted from the DET curves that by applying TIPM algorithm to the NIFS-VQ system, the performances under all conditions have been further improved. Meanwhile, When TIPM-NIFS-VQ system operated in all 12 noisy environments, the EER values dropped under all conditions except the volvo noisy environment with SNR of 25dB, where it kept stable at 11.88%, in comparison with NIFS-VQ system. Meanwhile, NIFS-TIPM-VQ system obtained the best EER result with 7.12% among all three VQ-based systems in the clean environment.
Figure 3. Experiment results
3.3 Result analysis
The feature matching algorithm (TIPM proposed in this paper can influence the number of matched feature vector pairs employed from the total available original matched feature vectors. When applied to the traditional VQ system, after the first-stage Procrustes match, the average matched pairs was decreased from 65.2% to 46.3% and from 14.7% to 9.9% for NIFS-VQ system. This is because the mismatched feature vector pairs were discarded. The figures increased to 51.2% and 11.5% for each system respectively after the second-stage TIPM, which revealed that some incorrectly discarded feature vector pairs have been recycled. In other words, the second-stage of the algorithm recycled some matched feature vector pairs that had been incorrectly discarded in the first-stage TIPM. In terms of the speaker verification result, it is clear that the proposed TIPM algorithm is effective when combined with either a regular speaker model (VQ) or combined with another preprocessing method (NIFS) with speaker model.
Table 1. THE EER OF VQ AND TIPM-VQ
For feature matching tasks, the incorrect match in the feature matching step always pose an negative impact on the performance. Aiming to solve this problem, this paper proposed a two-stage iterative Procrustes match algorithm that can discard mismatched feature vectors pairs between test data and codebooks in VQ-based systems. In order to evaluate the usefulness of the algorithm, speaker verification is used for case study. Two experiments were conducted by VQ baseline VS TIPM-VQ and NIFS-VQ VS NIFS-TIPM-VQ on a subset of the TIMIT database. Besides the clean condition, 12 different noisy conditions were also introduced, which are made up of four different noises with three different SNRs. The experiment result proved that the EER under almost all conditions can be slightly improved by using the TIPM. In addition, it even can obtain better result in 10 out of 12 environments when cooperated with a previous pre-processing method, which illustrated that it has a potential to work with other pre-processing or post-processing methods.
[1] Reynolds, D. A. and Rose, R. C., “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE transactions on speech and audio processing 3(1), 72–83 (1995).
[2] Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P., “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing 19(4), 788–798 (2011).
[3] Matsui, T. and Furui, S., “Comparison of text-independent speaker recognition methods using vq-distortion and discrete/continuous hmm’s,” IEEE Transactions on speech and audio processing 2(3), 456–459 (1994).
[4] Wong, L. P. and Russell, M., “Text-dependent speaker verification under noisy conditions using parallel model combination,” in [IEEE International Conference on], 1, 457–460, IEEE (2001).
[5] Zhao, X., Wang, Y., and Wang, D., “Robust speaker identification in noisy and reverberant conditions,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22(4), 836–845 (2014).
[6] Sagayama, S., Yamaguchi, Y., Takahashi, S., and Takahashi, J.-i., “Jacobian approach to fast acoustic model adaptation,” in [Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on], 2, 835–838, IEEE (1997).
[7] Xue, S., Jiang, H., Dai, L., and Liu, Q., “Speaker adaptation of hybrid nn/hmm model for speech recognition based on singular value decomposition,” Journal of Signal Processing Systems 82(2), 175–185 (2016).
[8] Xue, S., Abdel-Hamid, O., Jiang, H., and Dai, L., “Direct adaptation of hybrid dnn/hmm model for fast speaker adaptation in lvcsr based on speaker code,” in [Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on], 6339–6343, IEEE (2014).
[9] Song, S., Zhang, S., Schuller, B., Shen, L., and Valstar, M., “Noise invariant frame selection: a simple method to address the background noise problem for text-independent speaker verification,” arXiv preprint arXiv:1805.01259 (2018).
[10] Fischler, M. A. and Bolles, R. C., “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” in [Readings in computer vision], 726–740, Elsevier (1987).
[11] Song, S., Xia, S., Teng, Z., and Zhang, S., “A precise and real-time loop-closure detection for slam using the rsom tree,” International Journal of Advanced Robotic Systems 12(6), 73 (2015).
[12] Li, X., Zhao, L., Ji, W., Wu, Y., Wu, F., Yang, M.-H., Tao, D., and Reid, I., “Multi-task structure-aware context modeling for robust keypoint-based object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
[13] Zue, V., Seneff, S., and Glass, J., “Speech database development at mit: Timit and beyond,” Speech Communication 9(4), 351–356 (1990).
[14] Sheng-ping, L. J.-j. X. and Wen-xian, Y., “A two stage iterative procrustes matching algorithm based on sift feature [j],” Signal Processing 6, 011 (2010).