As businesses become more international, accent verifica-tion and classification gains more attention recently, probably because of the increasing demand for better recognizing non-native speakers and their accented speech. However, this problem is still very challenging, since there are many types of accents and the response time allowed for accent detection is usually very short. Choueiter et al. achieved accuracy of 32% classifying 23 types of accented English [1], using methods in language identification (LID), such as Maximum Mutual Information (MMI) training and Gaussian tokenization. Omar et al. recently integrated Universal Background Model (UBM) into Support Vector Machine (SVM) classifier and claimed that it outperformed the results in [1] by 75.3% relatively [2]. Another work for German vs. Spanish classification in [3] reported classification rates of 73% and 58.9%, using GMMs and naive Bayes classification respectively. In addition, classification rates of 36.2%, 17.7% and 13.2% were reported in [3], for 4-, 13- and 23-way classification using naive Bayes. To the best of our knowledge, these are the only three works, which used the same dataset used in this work.
In this work, we first created a baseline accent classi-fier for 7 selected types of English accents, using Gaussian Mixture Model-Universal Background Model (GMM-UBM), with normalized Perceptual Linear Predictive (PLP) features. The feature were then dimension-reduced and discriminatively optimized using Principle Component Analysis (PCA) and Heteroscedastic Linear Discriminant Analysis (HLDA). Since most identifiable accents are presented from the pronunciation of vowels rather than consonants [4], multiple vowel-specific GMMs were computed with features of the vowel components, extracted either from phoneme alignment (in system development) or phoneme recognition (in system test). Compared with the baseline with pure acoustic information, the improved 7-way classification system increases accuracy from 42% to 54%, using only up to 20 seconds speech data.
This work was initiated during the author’s internship at Interactive Intelligence (ININ) [5], and the algorithm and experiments were later refined for better accuracy and ef-ficiency [6]. This paper reviews the major components of the accent classification system, with highlights on the recent improvements in feature generation and constructing baseline and improved classifiers. The remaining paper is organized as follows: Sec. II introduces the database used in this work and a process of feature optimization and dimension reduction. In Sec. III, the main concept of creating accent-adapted features based on phonetic vowels is demonstrated. Then, the baseline classifier and improved version with vowel extraction are described in Sec. IV, followed by the results, conclusion and future work in Sec. V.
Preprocessing such as data and feature preparation, signifi- cantly impact the performance of classification. In this section, we introduce the database used in this work, discuss the feature extraction including normalization and Gaussianization, and feature optimization and dimension reduction with PCA and HLDA. The whole process is illustrated in Fig.1.
PCA/ HLDA Feature Extraction speech PLPs Improved features Silence Removal
Fig. 1. Process of data and feature preparation
A. Database
The Foreign Accented English (FAE) corpus from Linguistic Data Consortium (LDC) with catalog number LDC2007S08 is used in this work. It is one of the most comprehensive accented English speech database currently available, which contains 4925 sentences of 23 types of accents, with 20 second duration on average.
Accents are divided into 7 major categories based on their relationships shown in Table I and one accent from each group is selected for developing a 7-way accent classifier. Table II
TABLE I TYPE OF ACCENTS IN FAE CORPUS
TABLE II SUMMARY OF SELECTED ACCENTS IN FAE CORPUS
summarizes some statistics in each of these accent groups, such as 1) the number of utterances; 2) their proportion in the entire FAE corpus, 3) the total durations before and after silence removal, and 4) their corresponding compression ratio (DirDir
). Since there is no transcription comes along with the speech in FAE, we also transcribed the audio data of these 7 major accents, in order to perform vowel extraction from speech using phoneme alignment later. The selected partial dataset of FAE were randomly divided into training, development and testing with ratio 70 : 15 : 15.
B. Silence Removal
In practice, only the high signal-to-noise ratio (SNR) regions of the waveform are retained for classification. Therefore, silence removal or so-called voice activity detection (VAD) is often performed before feature extraction. Here we use the method described in [7], which detects the silence by thresholding on the short-time energy rate and spectral centroids of the speech. One can also use either Auditory Toolbox [8] or Voicebox [9] for the same purpose.
Given as the audio samples in the
frame, its short-time energy rate
, can be formulated as
where N is the number of samples in one frame. The spectral centroid can be defined as
Fig. 2. Example of silence removal using short-time energy rate and spectral centroids (FAR00035.wav in FAE)
where is the Discrete Fourier Transform (DFT) coefficients of
is used to discriminate silence with environmental noise, while
is used to remove nonspeech noise, such as coughing, because of its lower energy concerntration in the spectrum, relative to that of normal human speech.
Fig.2 shows an example of silence removal with both measurements of short-time energy rate and spectral centroids on data file FAR00035.wav in FAE corpus with Arabic (AE) accents. It is considered to be silence if either of these 2 measurements is lower than its threshold. As shown in Table II, the total duration of recording for each type of accents were reduced after silence removal.
C. Feature Extraction and Optimization
After silence removal, the data of the selected accents were then transformed to 39-dimensional PLP windowed feature frames with 10 millisecond each, using the method in [10]. Feature Mean and Variance Normalization (MVN) and shortterm Gaussianization (a.k.a feature warping) were applied afterwards using the method in [11]. The latter warps the distribution of the feature to a standard normal distribution to mitigate the effects of locally linear channel mismatch. This is specially useful because the features distribution here can be modeled by Gaussians [12].
The normalized and wrapped features were further improved by PCA and HLDA for dimension reduction and optimization. PCA is commonly used for dimension reduction, which preserves the data dimensions with larger variations in the eigenspace. It has been applied to many applications, such as face recognition [13] and speech evaluation [14], etc. It also helps to regularize the data and avoid over-fitting in HLDA which is performed afterwards [15]. Here the feature dimension is reduced from 39 to 30 after applying PCA.
Compared with PCA, LDA reduces dimensions by mapping data into a subspace while maximizing the discriminative information. It overcomes the weakness of PCA, when the discriminative information is actually in the dimension with less variation. It has been also applied to many problems, such as face recognition [16] and speaker recognition [17]. Assume there are number of M-dimensional data vectors
in S classes, where
is the number of vectors in class
. Let the global mean
over all classes be
and the local mean
for each class s be
, the between-class scatter
and within- class scatter
can be defined as
and
The first definitions of Eq. (3) and Eq. (4) consider the class weights, i.e. the sizes of each class s, while the second does not. The first definitions are consistent with the LDA definition used in Kumar’s HLDA work [18] and are used in this work. However, the second definitions of both formulas are also provided for completeness. is likely to be singular if there is not enough data in that class, however, by applying PCA first, this problem can be significantly alleviated.
Define w as a direction in the underlying W to be transformed to, and
are the projections of
and
onto w and searching for directions w for the best class discrimination is equivalent to maximizing the ratio of
subject to
. The latter is called the Fisher Discriminant function and can be converted by Lagrange multipliers and solved by eigen-decomposition of
. By selecting eigenvectors associated with the most significant m eigenvalues of
, one can map the original M-dimensional data into a m-dimensional subspace for discriminative feature reduction.
LDA is derived with the assumption that features in various dimensions have the same variance, but in real scenario, there are examples illustrating LDA may transform data into a suboptimal space when the dimension variances are different. Hetroscedastic LDA (HLDA) is a generalization of LDA using Maximum Likelihood Estimation (MLE) on Gaussian ditributions, which removes this assumption. An improved version of Kumar’s HLDA algorithm with more flexibility and higher efficiency was developed in MATLAB and used in this work [19]. The context size C is set to 1 and the feature dimension is further reduced from 30 to 20 after applying HLDA.
Minematsu et al. [20] and Suzuki et al. [4] demonstrated that, for a particular speaker, the location of 5 fundamental vowels in the feature space of a target language (such as English in this work), is relatively consistent. Therefore, they can be extracted as accent-adapted features and used for identifying accent of that speaker. Fig.3 is a simple demonstration of 5 vowels from both accented and non-accented (standard) languages in the reduced 2-dimensional feature space [20]. The center in each pentagon is the weighted average of five vowels based on their positions in feature space and frequency of appearance in the corpus. By matching the center of the pentagon of the standard and the accented language into the overlapped pentagon in the bottom of Figure 3, the Bhattacharyya distances [21] between each pair of corresponding vowels and their angles can be computed and stored in a vector. This vector represents the difference from the accented language
to the standard one L. To classify the test speech into one of the accent categories
, where N is the number of accent categories, the difference from
to
and V (category of standard language) are computed, compared and classified to the nearest category of accent.
Fig. 3. Comparison of 5 vowels locations in standard and accented language
A. Phoneme Alignment in System Development
In order to extract vowels from speech data, phoneme alignment was performed during system development, with the in-house transcriptions of partial FAE corpus covering the 7 accents, each of which is from one major type of the accent groups. We prepared the dictionary needed for phoneme alignment using HVite in HTK [22], through a procedure including transcription cleaning, word collection, word-to-pronunciation conversion, etc. Figure 4 demonstrates the process of dictionary preparation and phoneme alignment for FAE corpus. The dictionary file is a list of pairs of words and pronunciations in HTK format, which can be obtained through the process of word collection, word-to-pronunciation conversion with ININ Lexicon Tester and HTK dictionary file creation. In Phoneme alignment, the HTK configuration file, HMM model definition and tired list were all trained using Fisher corpus.
Fig. 4. Dictionary preparation and phoneme alignment for FAE corpus
B. Phoneme Recognition in System Test
During system test, there is no transcription available. To find features corresponding to vowels, phoneme recognition was performed on the test accented speech using HTK. Since the recognition cannot be perfect, only a subset of recognized vowels with level of confidence score higher than a threshold based on the n-gram log likelihood were used. This threshold was predefined with the training and development data.
The Gaussian Mixture Model-Universal Background Model (GMM-UBM) framework has been sussessfully applied to speech verification and classification systems [23]. The accent classification algorithm developed in this paper treats accents as speakers, and models the attributes of accents using GMMUBM, which is similar to modeling the attributes of speakers in speaker classification systems. Modern speaker classifica-tion systems train a general Gaussian Mixture Model (GMM) with data from all speakers, so-called Universal Background Model (UBM) and then generate individual GMMs for each speaker by adapting UBM with features from individual speakers. Subsec. IV-A and IV-B provide an outline of applying the similar framework to an accent classification problem.
A. Universal Background Model (UBM)
The Universal Background Model (UBM) is a general GMM trained with features from all types of accents. Given a GMM and N is the number of mixture components, the likelihood function for a feature frame x can be formulated as
where
The parameters of GMM , including weight
and covariance matrix
can be optimized by ExpectationMaximization (EM) algorithm [24]. Here
is restricted to be diagonal. Usually the feature vectors are assumed independent, so the log-likelihood of a GMM
for a sequence of K feature vectors,
is computed as
B. Adaptation of Accent Model
In the GMM-UBM system, we derive the individual accent GMMs by adapting the parameters of the UBM using the training speech
of each accents and a form of Bayesian adaptation. The adaptation is a two step process and the first step is identical to the E-step of the EM algorithm, where we determine the probabilistic alignment of X into the UBM mixtures. That is for
component of the UBM, we compute
Then, the weight, mean and variance can be computed by
where is shorthand for diag(
). Finally, these new sufficient statistics from the training data are sued to update UBM for mixture i, to create the adapted parameters for mixture i in the accent GMMs, with the equations:
where are the adaptation coefficients for the weights, means and variances respectively, controlling the balance between old and new estimates. They can be derived from the Maximum a posteriori (MAP) estimation equations for a GMM using constraints on the prior distribution described in [25]. The scale factor
is computed over all adapted mixture weights to ensure they sum to 1.
C. Baseline Classifier
After obtaining the adapted GMM parameter set through GMM-UBM framework for accent class
, the GMMbased classifier, which maximize a posteriori probability for K M-dimensional feature vectors
) can be formulated as:
The first equation is due to Bayes’ rule. The first proportion is assuming and p(X) is the same for all accent models. The second proportion uses logarithm and independence between input samples
, explained before in Eq. (7). Providing accent features for UBM adapation, Fig.5 shows the diagram for baseline classification with accent GMM classifiers.
Fig. 5. Baseline GMM classifiers adapted from UBM
D. Improved Classifier
The improvement from the baseline is mainly contributed from the vowel extraction. To construct the classifier with vowel representation, instead of directly measureing the shift of vowels from the standard speech to the accented one, the speech segments from the same vowel of various types of accents were concatenated and used to train vowel-specific UBMs. Fig.6 shows that each of the T UBMs was then adapted to S separated GMMs using data from various accent types. Here T is the number of vowels used in this work. Instead of
Fig. 6. Improved GMM classifiers with vowel extraction
using only the fundamental 5 vowels described in Sec. III, the same concept was generalized and all 15 vowels in Arpabet [26] were used, which are listed in Table III. S is the number of selected accent types from FAE corpus (Table II), which is 7 in this work. Given extracted features of T types of vowels
TABLE III VOWELS IN ARPABET
from accented test feature X, the improved GMM accent classifier as the combination of GMM classifiers of all vowels (shown at the bottom box in Fig.6) can be formulated as
where is the GMM for
accent and
type of vowels, and
is the weight of the vowel-specific GMM classifier for
vowel. Adding this additional layer on the GMM classifier is critical to find the vowel sets which preserve the accents and later shown to improve on classifying accents.
There are two factors considered in the vowel weight in Eq. (16), which are 1) the popularity (proportion) of
vowels in the whole vowel set
, and 2) the discriminativeness
, i.e. the difference in the distributions of the same vowels extracted from different accents. The first factor is based on the assumption GMMs trained with more data is more reliable than the ones with less data. Fig.7 shows the popularity of vowels in descending order. It show the vowel ah is much more popular (frequent) than the vowel oy in the selected dataset. The second factor is based on the assumption that
Fig. 7. Popularity of vowels in the descending order of the corresponding feature frames in training data
vowels are more discriminative if the distributions of GMMs of the same vowel but different accents are far apart. For example, Fig.8 shows the Hellinger distances [27] computed between any 2 GMM distributions of 7 different accent types for the vowel aa. The discriminativeness factor for aa is just the reciprocal of the mean of these distances from
combinations (the smaller mean, the more weight). Here we simply used
to compute the weights of vowels, which assumes both factors are equally important.
Fig. 8. Example of distance measurement among GMMs of same vowel (aa) but 7 different accent types
The baseline classification was based on accent GMM clas- sifier with 256 mixtures, adapted from UBM, using normalized and warped 39-dimensional PLPs. The features were then optimized using PCA and HLDA with context size C = 1, and the dimension was reduced from 39 to 20. With the enhanced feature, the baseline accuracy was increased from 42.3% to 47.9%. The main contribution of accuracy improvement was from the classifier of the combination of weighted vowels, which further increased the 7-way classification rate to 53.7%. Table IV shows the performance of all these 3 experiments with various models and features.
TABLE IV 7-WAY ACCENT CLASSIFICATION UNDER GMM-UMB FRAMEWORK WITH ACOUSTIC AND PHONETIC FEATURES
This work demonstrates that methods in speaker recognition can be used for accent classification. With several feature optimization techniques and phonetic vowel information, the accuracy obtained from accented speech as short as 20 seconds, is competitive compared with the state of the art in [1], [2] and [3]. In the future, more recent classification methods such as i-vector (eigenvoice component) [28], or neural network classifier [29] can be explored, used or combined with the current methods. More data-driven techniques can be experimented, such as 1) training distinct UBMs for male and female accented speakers, 2) using tri-phone vowel set instead of the current mono-phone vowel set for more refined classification, 3) selecting a subset of vowels rather than using all 15 vowels in Arpabet by experiment for better classification results, etc.
[1] G. Choueiter, G. Zweig, and P. Nguyen, “An empirical study of automatic accent classification,” in ICASSP 2008. IEEE International Conference on. IEEE.
[2] M. K. Omar and J. Pelecanos, “A novel approach to detecting non-native speakers and their native language,” in ICASSP, 2010 IEEE International Conference on. IEEE.
[3] J. Mac´ıas-Guarasa, “Acoustic adaptation and accent identification in the ICSI MR and FAE corpora,” in ICSI Meeting slides, 2003.
[4] M. Suzuki, L. Dean, N. Minematsu, and K. Hirose, “Improved structure- based automatic estimation of pronunciation proficiency,” Proc. SLaTE, vol. 5, 2009.
[5] Z. Ge, Mispronunciation Detection with Multiple Applications: for Language Learning and Speech Recognition Adaptation and Improvement. Scholars’ Press, 2014.
[6] Z. Ge, Y. Tan, and A. Ganapathiraju, “Accent classification with phonetic vowel representation,” in Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on. IEEE, 2015.
[7] T. Giannakopoulos, “A method for silence removal and segmentation of speech signals, implemented in matlab,” University of Athens, Athens, 2009.
[8] M. Slaney, “Auditory toolbox,” Interval Research Corporation, Tech. Rep, vol. 10, p. 1998, 1998.
[9] M. Brookes et al., “Voicebox: Speech processing toolbox for matlab,” Software, available [Mar. 2011] from www. ee. ic. ac. uk/hp/staff/dmb/voicebox/voicebox. html, 1997.
[10] D. Ellis, “PLP and RASTA (and MFCC, and inversion) in MATLAB,” http://labrosa.ee.columbia.edu/matlab/rastamat/, accessed 2015-07-01.
[11] S. O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1. 0: A matlab toolbox for speaker recognition research,” Speech and Language Processing Technical Committee Newsletter, 2013.
[12] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verification,” 2001.
[13] M. Turk, A. P. Pentland et al., “Face recognition using eigenfaces,” in Computer Vision and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE Computer Society Conference on. IEEE, 1991, pp. 586–591.
[14] Z. Ge, S. R. Sharma, and M. J. Smith, “PCA method for automated detection of mispronounced words,” in SPIE Defense, Security, and Sensing. International Society for Optics and Photonics, 2011, pp. 80 581D–80 581D.
[15] J. Yang and J.-y. Yang, “Why can LDA be performed in PCA trans- formed space?” Pattern recognition, vol. 36, no. 2, pp. 563–566, 2003.
[16] J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Face recognition using LDA-based algorithms,” Neural Networks, IEEE Transactions on, vol. 14, no. 1, pp. 195–200, 2003.
[17] Z. Ge, S. R. Sharma, and M. J. Smith, “PCA/LDA approach for text-independent speaker recognition,” in SPIE Defense, Security, and Sensing. International Society for Optics and Photonics, 2012, pp. 840 108–840 108.
[18] N. Kumar and A. G. Andreou, “Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition,” Ph.D. dissertation, Johns Hopkins University, 1997.
[19] Z. Ge, “Mispronunciation detection for language learning and speech recognition adaptation,” 2013.
[20] N. Minematsu, “Yet another acoustic representation of speech sounds,” in ICASSP’04. IEEE International Conference on. IEEE.
[21] A. Bhattacharyya, “On a measure of divergence between two multi- nomial populations,” Sankhy¯a: The Indian Journal of Statistics (1933-1960), vol. 7, no. 4, pp. 401–406, 1946.
[22] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey et al., “The HTK book (for HTK version 3.4),” 2006.
[23] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1, pp. 19–41, 2000.
[24] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38, 1977.
[25] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains,” Speech and audio processing, ieee transactions on, vol. 2, no. 2, pp. 291–298, 1994.
[26] Wikipedia, “Arpabet,” http://en.wikipedia.org/wiki/Arpabet, August 2011, accessed 2015-07-01.
[27] M. Kristan, A. Leonardis, and D. Skoˇcaj, “Multivariate online kernel density estimation with gaussian kernels,” Pattern Recognition, vol. 44, no. 10, pp. 2630–2642, 2011.
[28] P. Matˇejka, O. Glembek, F. Castaldo, M. J. Alam, O. Plchot, P. Kenny, L. Burget, and J. H. ˇCernocky, “Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification,” in Acoustics, Speech and Signal Processing, 2011 IEEE International Conference on. IEEE, 2011.
[29] Z. Ge and Y. Sun, “Sleep stages classification using neural networks with multi-channel neural data,” in Brain Informatics and Health. Springer, 2015, pp. 306–316.