I-vector Based Features Embedding for Heart Sound Classification

2019·arXiv

Abstract

Abstract

Cardiovascular Disease (CVD) is considered as one of the principal causes of death in the world. Over recent years, this field of study has attracted researchers’ attention to investigate heart sounds’ patterns for disease diagnostics. In this study, an approach is proposed for normal/abnormal heart sound classification on the Physionet challenge 2016 dataset. For the first time, a fixed length feature vector; called i-vector; is extracted from each heart sound using Mel Frequency Cepstral Coefficient (MFCC) features. Afterwards, Principal Component Analysis (PCA) transform and Variational Autoencoder (VAE) are applied on the i-vector to achieve dimension reduction. Eventually, the reduced size vector is fed to Gaussian Mixture Models (GMMs) and Support Vector Machine (SVM) for classification purpose. Experimental results demonstrate the proposed method could achieve a performance improvement of 16% based on Modified Accuracy (MAcc) compared with the baseline system on the Physionet2016 dataset.

Keywords: Heart Sound Classification, i-vector, Gaussian Mixture Models, Support Vector Machine,

Principal Component Analysis, Variational Autoencoders

1. Introduction

Cardiovascular disease (CVD) is one of the most common causes of death around the world and the leading cause of disability. Based on the information provided by the World Heart Association, 2017, 17.7 million people die every year due to CVD, equal to 31% of all global deaths. The most prevalent CVDs are heart attacks and strokes. In 2013, all 194 members of the World Health Organization agreed to implement the Global Action Plan for the Prevention and Control of Non-communicable Diseases, a plan for 2013 to 2020, to be prepared against CVDs. Implementation of nine global and voluntary goals in this plan led to a significant drop in the number of premature deaths due to non-communicable diseases.

Accordingly, in recent years, researchers have showed a considerable interest in detecting heart diseases based on heart sounds [1]. Most approaches in this context rely on sound segmentation and feature extraction. Extracted features are then fed to machine learning methods to simulate the system performance on real-world datasets. In addition, various studies are conducted for normal/abnormal heart sound classification using segmentation methods. Methods in this field can be categorized into three groups; Segmentationbased approaches, Wavelet-based approaches, and Time-Frequency based approaches. Methods in the first category, focus on using variable window of short time Fourier transform (S-transform), Hilbert transform, etc. in order to segment each audio and then apply different classification to detect hear normal/abnormal behavior [2][3][4][5]. The second category is based on the wavelet transformation. Wavelet features are then fed to the well-known approaches such as SVM [6][7]. Finally, recent approaches are concered with using Time-Frequency features such as Discrete Fourier Transform (DFT), MFCC, and so on. To determine each test sample class, varied classification approaches such as Convolutional Neural Network (CNN), Artificial Neural Network (ANN), etc., were adopted. [8][9][10][11].

Although approaches in this field have made remarkable progress, still they are not able to yield the desired performance. One reason for this issue is the paucity of systems that extract appropriate speech features from heart sounds. Hence, representing appropriate features is the most essential step toward improving the performance, although it is not achievable without using a suitable classifier. In the proposed framework, i-vector is used as a feature representation technique. The underlying motivation for using i-vector in this context is that human heart sounds can be considered as the physiological traits of a person [12] and only irregular events such as accidents, illnesses, genetic defects, or aging can alter or destroy these traits [12]. As a result, heart sounds are prone to being introduced by representative features like i-vectors. In the classification stage, GMMs and SVM are employed which are persuaded for modeling i-vector with mixture nature. In addition, feature reduction techniques such as PCA and VAE are also applied blueto achieve a compact and informative representation of the extracted i-vectors which lead to having discriminative features. So our contributions are summarized as follow:

• While i-vector is employed predominantly in various speech-related tasks, it is less known to the biosignal

• PCA and VAE are employed to reduce the size of i-vector and also extract the most significant features

• Finally, the proposed approach outperforms the previous studies in terms of Specificity (Sp), Sensitivity

In the following sections, first the related works are presented, then the proposed framework are described. Next, the experimental setup is discussed and then the results are analyzed in the experimental results section. Finally, the conclusion section concludes the whole research study.

2. Related Studies

A review of current approaches is presented in [15]. Each presented approach is systematically reviewed and existing approaches are analyzed based on their performance. In the following section, a brief discussion

about approaches in this field will be presented, which categorizes them into two subcategories.

2.1. Segmentation Based Approaches

Approaches in this category use features and audio segments to classify normal/abnormal heart sounds. In a study by [2], an approach was proposed for automatic segmentation, using Hilbert transform. Features for this study included envelops near the peaks of S1, S2, the transmission points T12 from S1 to S2, and vice versa. Database for this study consisted of 7730s of heart sound from pathological patients, 600s from normal subjects, and finally 1496.8 s from Michigan MHSDB database. The average accuracy for sound with mixed S1, and S2 was 96.69%, and it was reported 97.37% for those with separated S1 and S2.

CNN based segmentation is proposed in [16], which is common in image processing tasks, to segment heart sounds into their main components. The same concept is used in [17] which incorporates CNN to extract short segment features from 1-dimensional, e.g. raw heart sound signal, and 2-dimensional, e.g. time-frequency representation of heart sound to segment these signals. Another envelope extraction method was employed for heart sound segmentation is called Cardiac Sound Characteristic Waveform (CSCW). The work presented in [3] used this method for only a small set of heart sounds, including 9 sound recordings and 99.0% accuracy was reported. No train-test split was performed for evaluation in this study. The work in [4] achieved an accuracy of 92.4% for S1 and 93.5% for S2 segmentation by engaging homomorphic filtering and Hidden Markov Model (HMM), on the PASCAL database [5].

2.2. Wavelet Based Approaches

Wavelet-based approaches employ the wavelet transform to extract features, and then these extracted features are subsequently used for classification. In [18] the Shannon energy envelops for the local spectrum are calculated by a new method, which uses S-transform for every sound produced by the heart sound signals. Sensitivity and positive predictivity were evaluated on 80 heart sound recordings (including 40 normal and 40 pathological), and their values were reported over 95%. The work investigated in [19] also adopted the same approach with wavelet analysis on the same database and accuracy was reported 90.9% for S1 segmentation and this value was 93.3% for S2 segmentation. The work in [6] also conducted a study to classify normal and pathological cases using Least Square Support Vector Machine (LSSVM) engaging wavelet to extract features. They evaluated their method on a dataset with heart sound of 64 patients (32 cases for train and 32 cases for test set) and reported 86.72% for accuracy. In a work [7] with the same classifier, wavelet packets and extracted features are engaged like sample entropy and energy fraction as input. The dataset used for this problem consisted of 40 normal individuals and 67 pathological patients and they resulted in 97.17% accuracy, 93.48% sensitivity and 98.55% specificity. Another study [20], also used LSSVM as classifier while using the tunable-Q wavelet transform as input features. Evaluation in this study showed 98.8% sensitivity and 99.3% specificity on a dataset comprising 4628 cycles from 163 heart sound recordings, with the unknown number of patients. Fractional Fourier transform is proposed for feature extraction in [21] and the extracted features are subsequently classified by a stacked autoencoder which yields a performance accuracy of 95%.

A study on the expected duration of heart sound using HMM and Hidden Semi-Markov Model (HSMM) was introduced in [22]. In this study, positions of S1 and S2 sounds were initially labeled in 113 recordings. Afterwards, they calculated Gaussian distributions for the expected duration of each four states including S1, systole, S2, and diastole, using the average duration of mentioned sound and also autocorrelation analysis of systolic and diastolic duration. Homomorphic envelope plus three other frequency features (in 25-50, 50-100 and 100-150 Hz ranges) were among features they used for this study. Then they calculated Gaussian distributions for training HMM states and emission probabilities. Finally, for the decoding process, the backward and forward Viterbi algorithm was engaged. Sensitivity and Specificity were reported 98.8% and 98.6%, respectively. This work also proposed HSMM alongside logistic regression (for emission probability estimation) to accurately segment noisy, and real-world heart sound recording [23]. This work also used Viterbi algorithm to decode state sequences. For evaluation, they used a database of 10172s of heart sounds recorded from 112 patients. F1 score for this study is reported to be 95.63%, improving over the previous state of the art study with 86.28% on the same test set.

Other studies were also developed using other methods based on the feature extraction and classification using machine learning classifiers such as ANN, SVM, HMM, and K-Nearest Neighbor (KNN). For the distinction between spectral energy between normal and pathological recordings, the work introduced in [24] extracted five frequency bands and their spectral energy was given as input to ANN. Results on a dataset with 50 recorded sounds reveal 95% sensitivity and 93.33% specificity. In a study by [25], a discrete wavelet transform as well as a fuzzy logic was used for a three-class problem; including normal, pulmonary stenosis, and mitral stenosis. An ANN was employed to classify a dataset of 120 subjects with 50/50 split for train and test set. Reported results were 100% for sensitivity, 95.24% for specificity, and 98.33% for average accuracy. Moreover, they used time-frequency as input for ANN in [8]. This work reported 90.4% sensitivity, 97.44% specificity, and 95% accuracy on the same dataset for the same problem (three-class classification including normal, pulmonary and mitral stenosis heart valve diseases).

HMM was used by [10] to fit on the frequency spectrum from the heart cycle and used four HMMs for evaluating the posterior probability of the features given to model for classification. For better results, they used PCA as reduction procedure and results reported 95% sensitivity, 98.8% specificity, and 97.5% accuracy on a dataset with 60 samples. The KNN was used in [11] on the features from various time-frequency representation. Features were extracted from a subset of 22 persons including 16 normal participants and 6 pathological patients. Accuracy was reported 98% for this problem where the likelihood of over-training was used as parameters for KNN. The work investigated in [11] also chose KNN for clustering the samples into normal and pathological. This study also employed two approaches for dimensionality reduction of extracted time-frequency features; linear decomposition and tiling partition of mentioned features plane. Results were achieved on total of 45 recordings; including 19 pathological and 26 normal, and an average accuracy of 99% was reported with 11-fold cross-validation.

Table1 summarizes the research studies cited in this section.

Table 1: Summary of the previous heart sound works, methods, database and results [1].

Although the i-vector was originally used for speaker recognition applications [26], it is currently used in various fields such as language identification [27, 28], accent identification [29], gender recognition, age estimation, emotion recognition [30, 31], audio scene classification [32], spoofing detection in automatic speaker verification systems [33], etc. In this study, the i-vector is adopted for normal/abnormal heart sound classification task. Our approach is categorized in the time-frequency category, where MFCC is extracted and then the i-vector method is employed to extract features based on individuals’ heart sound characteristics.

3. Proposed Framework

In this study, the proposed framework aims at using the i-vector for normal/abnormal heart sound clas-sification. First, MFCC feature vectors are extracted from heart sound records. Then, In order to extract i-vector, a large GMM (e.g. 2048 components), called Universal Background Model (UBM) is trained using extracted MFCC features from all heart sound records (i.e. both normal and abnormal) in the training set. Subsequent to UBM training, zero and first-order statistics of the training features are extracted, accordingly. Then, these statistics are used to train the i-vector extractor through several iterations of the EM algorithm which will be explained in Section 3.2.4. After training the i-vector extractor, i-vectors are extracted from all records in the training set. At this stage, a fixed length i-vector is extracted for each record and then is fed into PCA or VAE in order to reduce its size and the intra-class variation as well. Eventually, there is a representative i-vector for each record, which will be used for classification.

Fig. 1 briefly illustrates our proposed system.

3.1. Mel-frequency Cepstral Coefficients

MFCCs were employed over the years as one of the most salient features for speaker recognition [34]. The MFCC attempts to model human hearing perceptions by focusing on low frequencies (0-1KHz) [35]. In better words, the differences of critical bandwidth in the human ear are the basis of what we know as MFCCs. In addition, Mel frequency scale is applied to extract critical features of speech, especially its pitch.

3.1.1. MFCC Extraction

In the following section, we will explain how the MFCC feature is extracted. Initially, the given signal s[n] is pre-emphasized. The concept of ”pre-emphasis” means the reinforcement of high-frequency components

passed by a high-pass filter [34]. The output of the filter is as follows

In the next step, named as framing, the pre-emphasized signal is dividing into same length short-time

frames(e.g. 25ms) in order to achieve stationarity. Subsequently, the Hamming windows is applied as

Figure 1: Block diagram of the proposed system.

where N is the number of samples in each frame. Heart sounds are sampled by 2 KHz frequency ratio, each frame has length of 25 ms, and number of samples is 50.

To analyze h[n] in the frequency domain, an N-point Fast Fourier Transform (FFT) is applied to convert

it into the frequency domain according to

A logarithmic power spectrum is obtained by log energy computation block, on a Mel-scale using a filter bank that consists of L filters.

where ) is the absolute value of complex Fourier transform, ) is the lth triangular filter, and are the lower limit and upper limit of the lth filter, respectively. In our experiment, the number of filters in the filter bank; L, was set to 20.

The given frequency f in hertz can be converted to Mel-scale as follow

Figure 2: Block diagram of MFCC feature extraction [27].

Eventually, the MFCCs coefficients are obtained by applying Discrete Cosine Transform (DCT) to the X[l]

where m is the index of obtained MFCC components and M is the number of MFCC features, which was set to 12. The steps for extracting the MFCC features are depicted in Fig. 2.

3.2. i-Vector

The aim of this paper is to propose a framework for the heart sound classification using i-vector which was first proposed for speaker recognition application [26] and later was adopted in other applications such as language identification [27], emotion recognition [36], music genre classification [37], and online signature verification [38], etc. i-vector can be considered as a technique to map a sequence of feature vectors for a given sample into a low-dimensional vector space, referred to as the total variability space, based on a factor analysis technique. In other words, it is a technique to extract a compact fixed-length representation given a sequence of feature vectors with arbitrary length. Then, the extracted compact feature vector can be either used for vector distance-based similarity measuring or as input to any further feature transform or modelling.

There are determined steps to extract i-vector from a heart sound record. First, MFCC feature vectors should be extracted from the input signal and then the Baum–Welch statistics should be extracted from the features, and finally i-vector is computed using these statistics. In the following subsections, we go through these steps in details.

3.2.1. Universal Background Model Training

The first step in implementing i-vector extraction pipeline is to create a global model which is called UBM which is used to map the features to a high-dimensional space to give a better representation. Gaussian mixture models (GMMs) have been frequently used for building an UBM, especially in the text-independent speaker verification task [26, 39]. GMM estimates the distribution of extracted MFCC features using of a finite number of Gaussian distributions. Here, the GMM model is trained by MFCC features from all heart sound records in the training set which is supposed to be large enough to cover all the feature space.

3.2.2. Extraction of Baum–Welch Statistics

Here, for each MFCC feature sequence, the zero and first-order Baum-Welch statistics are extracted using UBM which is modeled by a GMM. [40, 41].

Suppose as the whole feature vectors collected to train ith heart sound; then the zero and first-order statistics for the cth component of UBM named and are calculated as follows:

where is the tth MFCC feature vector for heart sound ith, indicates mean of cth component, and

3.2.3. i-vector Extraction

Suppose M is a mean-supervector which represents the feature vectors of a heart sound record. Supervector of each record is a DC-dimensional vector obtained by concatenating D-dimensional mean vectors of the its GMM. GMM for each record is obtained by MAP adaptation. The supervector of record is modelled as follows [26]:

where m is an independent mean-supervector (m = [) extracted from the UBM, T is a low- rank matrix, and represents a random latent variable with a standard normal distribution for record. is assumed to has a Gaussian distribution with mean m and covariance matrix , where is regarded as transpose of T. The i-vector is the MAP point estimation of the variable which is equal to the mean of the posterior probability of given the record.

In Eq. 10, m and T as parameters should be estimated. m as a mean-supervector is obtained by concatenating the means of the UBM components [41]. To obtain T, expectation maximization (EM) is applied. Assume the UBM has C components ( in this work, is set to 2048), and dimensions of feature vectors are D,

the matrix is described as

where Σis the covariance matrix of the component of of the UBM. Let be all feature vectors of

record and Σ) indicates the likelihood of computed with the GMM specified by the supervector and the super-covariance matrix Σ, then the EM optimization can be performed by iterating the following two steps. First, the current value of matrix T is used to estimate the vector that maximize the likelihood as follows:

Then, T is updated by maximizing the following relation:

By taking the logarithm of Eq. 13, log-likelihood of each record can be computed as:

where c iterates over all components of the UBM and t iterates over all feature vectors and is a submatrix of T which is related to the component. Let the zero and the first-order statistics have been calculated by Eq. 7 and Eq. 8, respectively, the the posterior covariance matrix, ), mean ], and the second moment ] are computed for as:

Ultimately, by maximizing Eq. 13, the updated value of T can be calculated as

As said before, i-vector is the mean of the posterior probability of given input record where is a random hidden variable with a standard normal distribution. To extract i-vector, the MAP point for w is estimated and it formula is described as Eq.16.

3.3. Techniques for reducing the feature dimension and the effects of intra-class variations

There are several techniques for reducing the feature dimension and the effects of intra-class variations. In the i-vector based applications, various techniques such as nuisance attribute projection (NAP) [26, 41, 42, 43], within-class covariance normalization (WCCN) [26, 44, 45], principal component analysis (PCA) [45], and linear discriminant analysis (LDA) [46] are extensively employed. In this work, PCA and a new emerging technique called Variational Autoencoders (VAE) [47] are employed which will be explained in the following subsections.

Figure 3: Block diagram of VAE.

3.3.1. Principal Component Analysis

In this method, important information is extracted from the data as new orthogonal variables, which are referred to as the principal components [48]. To achieve this objective, assume a given zero mean data matrix X where n and p indicate the number of feature vectors and feature size, respectively. Accordingly, to define the PCA transformation consider vector of X which is mapped by a set of p-dimensional vectors

of weights = (to a new vector of principal component = (, as follows

where vector t (consists ) inherits the maximum variance from x by weight vector w constrained to be a unit vector [49].

3.3.2. Variational Autoencoder

As one of the most prominent approaches to extract valuable information is VAE which is among the generative models. This model attempts to reconstruct data from input data. In this regard, Consider x as the input for a VAE which seeks to encode the inputs into latent variables z, and then reconstructed input will be produced from the latent variables. To this end, the training process aims to minimize the cost function (Mean Square Error (MSE) between input and output). In the optimal situation, the input and output are the same. The architecture of VAEs comprise hidden layers [47] with odd numbers and d nodes. The weights are shared between top and bottom layers, which both have D nodes. Schematic of a VAE is depicted in Fig. 3. AS shown in Fig. 3, encoded variable z can be used as enhanced features for the better description of input x. To obtain the vector z, a probability function on x, called p(x), is defined, seeking to maximize likelihood of the mentioned probability; log p(x) [50]. shows the expectation of random variable z over probability function q(z|x). Since we have no information about p(z|x); an approximation of

p(z|x), called q(z|x), is computed. Thus, based on Bayes rule we have [50]

here, we multiply and divide the term by q(z|x) as an approximation for p(z|x)

So, it can be concluded that

Finally

the term B is intractable, and has a value greater than zero. As a result, the term A is attempted to be minimized as a tractable lower bound. The log-likelihood measure is a good indicator to show how much samples from q(z|x) can describe data x.

It is noteworthy that VAEs are a good solution for different problems such as missing data imputation and so forth [47].

3.4. Gaussian Mixture Models

In this study, GMMs are engaged as a classifier for the extracted features from heart sound records. GMMs are among models with the probabilistic nature, which are suitable for general distributions consisted of sub-populations [51]. GMMs use an iterative process to determine which data point belongs to each subpopulation, without any knowledge about data point labels. Hence, GMMs are considered as unsupervised learning models.

The GMM is introduced with two types of parameters: the weights of the Gaussian mixture components and the means and the variance of the Gaussian mixture components. The Probability Distribution Function (PDF) of a K components GMM, with mean and covariance matrix Σfor the component is defined

as

where x is a feature vector and is the weight of the mixture component .

If the number of components is defined, Expectation Maximization (EM) is a method that is often used to estimate the parameters of the mixture model. In frequentist probability theory, models are usually learned using maximum likelihood estimation techniques. The maximum probability estimate is engaged to maximize the probability or similarity of the observed data with respect to the model parameters [52]. The maximization of EM is a numerical method for estimating the maximum probability. Maximization of EM is a repetitive algorithm and has the property that the most similarity of data with each subsequent replication increases significantly, which means that it achieves to the maximum point or the local maximum point [52].

The maximizing likelihood estimation of Gaussian mixture models includes two steps. The first step is known as ”expectation”, which includes calculating the expectation and assigning the component () for each data point with the parameters of the model and Σ. The second step is known as ”maximization”, which includes maximizing the expectation calculated in the previous step relative to the model parameters. This step involves updating the values of and Σ. The entire process is repeated as long as the algorithm converges, giving maximum likelihood estimation. More details are available at [52].

3.5. Support Vector Machine

We also employed SVM as a different classifier to compare the obtained results with those of the GMM. Thus, a brief overview of this classifier is presented below. In this method, a hyperline is used to discriminate between samples by returning a solution to a two-class classification problem:

where, w is an unknown weight matrix to learn. ) denotes a fixed feature-space transformation and b is the bias parameter. Consider N training data with target values ). Each data is classified based on sign of y(x). Therefore, both and ) should have the same sign, and 0. There will be multiple solutions for each SVM problem, but the one with the smallest generalization error is desirable. ”Margin” is the term to describe the smallest distance between the decision boundary and any of the samples. The solution with the maximum value of the margin is chosen as the best solution. Considering margin definition, the distance from a point to the decision surface is calculated by:

We seek to optimize the parameters w and b to maximize the distance. This can be achieved by:

This problem can be converted to a less complex problem for easier solving. Since the scale has no effect on the solution there is freedom to consider the relation in Eq. 28 as follows:

The optimization problem in Eq. 28 requires to maximize , equal to minimizing . So we also have to solve the following optimization problem:

To solve Eq. 30 , Lagrange multipliers 0 is introduced for each constraint. So the Lagrange equation would be like:

Then derivatives of L(w, b, a) are set to zero with respect to w and b. Then:

Considering Eq. 32, the Eq. 26 can be rewritten as follows:

The form of Eq. 34 allows the model to be reformulated with kernels. Thus, a solution for problems with infinite feature space can be obtained by:

where ) is a positive definite kernel.

For this work Radial Basis Function (RBF) was used as kernel for SVM. The formulaiton of this kernel is given below:

4. Experimental Setup

4.1. Dataset

The 2016 Physionet/CinC challenge is introduced to provide a standard database containing normal and abnormal heart sound [1]. The presented dataset in this challenge is a heart sound recording set of subjects/patients, collected from a variety of environmental conditions (including noisy conditions with low signal quality) as described in [1]. Therefore, the majority of heart sounds have incurred different noises during recordings such as speech, stethoscope motion, breathing and intestinal activity [1]. These noises complicate the classification of normal and abnormal heart sounds. Accordingly, the organizers allowed the participants to classify some of the recordings as ’unsure’ [1] and it indicates the difficulty level of the challenge. This dataset consists of three subsets: training, validation, and test. For training purposes, six labeled databases (names with the prefix a to f) contain 3153 sound recordings from 764 subjects/patients, with the duration of 5-120 s).

The validation subset is comprised of 150 normal and 151 abnormal heart sound (with file names pre-fixed alphabetically, a through e) and the test data includes 1277 heart sound trials generated from 308 subjects/patients. It has to be noted that 301 selected recordings from train set were used as a test set for validation.

The Challenge test set consisted of six databases labeled from b to e, g, and i with 1277 heart sound recordings from 308 participants. The statistics of each subset are summarized and illustrated in Table 2. More details about the dataset and the 2016 Physionet/CinC challenge can be found in [1].

In this study, the results were reported based on the publicly available part of the Physionet/CinC 2016 dataset. It is worth mentioning that the training set is divided into two parts via five phases. In each phase, we randomly assigned 80% of the training set as our training set and the rest of 20% was assigned as our validation set which is used for tuning the parameters. In addition, we used Physionet/CinC 2016 validation set as our test set. Details of the dataset are presented in Table 2.

Table 2: Statistics of the 2016 Physionet/CinC dataset [1].

4.2. Evaluation Metrics

In this task, the metric of evaluation is based on Modified Accuracy (MAcc) introduced by Physionet 2016 challenge. For MAcc computation, data is categorized into three categories; normal, abnormal or unsure, with two references in each category. The modified sensitivity (Se) and specificity (Sp) can be computed according to:

where and are the percentages of the abnormal recordings of the signal with good quality and poor quality respectively, and and are of the normal recordings of the signal with good quality and poor quality respectively. For all 3153 training set recordings, values for weight parameters of , and are equal to 0.8602, 0.1398, 0.9252 and 0.0748 respectively, in the train set. These parameters were also calculated for validation set and were reported 0.78881, 0.2119, 0.9467 and 0.0533 respectively. The “Score”

for this challenge is computed using the following equation

4.3. Scoring and Decision Making

To assign a score to a given heart sound based on the GMM classifier we proceed as follows. First, an i-vector is extracted from the training set and is projected to the new space using the PCA or VAE. Afterwars, they are applied to two GMMs (one GMM for the normal heart sound and the other for the abnormal heart sound) with different components to learn the model by EM iterations (training GMMs). In the next step, the score for each trial is obtained by computing log likelihood ratio:

where S is an i-vector corresponding to the test record, while and denote the GMMs for normal and abnormal heart sounds, respectively. Once the score is found, a simple global threshold is applied to it to make the final decision of normal/abnormal heart sound classification. If the score is higher than the threshold, the test heart sound is labeled as normal and otherwise, it is labeled as abnormal. In this study, a global threshold able to plot the detection error trade-off (DET) and detection accuracy trade-off (DAT) curves was used.

5. Experimental Results

In this section, first we briefly introduce the baseline system and in the following, to verify the performance of proposed framework, we carried out various experiments using the physionet 2016 dataset. In Section 5.2, we investigate the effect of GMM components and i-vector dimensionality. Ultimately, the effect using different size of training set is examined, in Section 5.3.

5.1. Baseline System

In this research study, the proposed approach in [53] is considered as the baseline system. The Physionet 2016 dataset is used in the baseline system in the same manner that we used in the proposed system. The proposed method in the baseline system is based on Mel-Spectrogram, MFCC and sub-band envelopes features and different configurations of CNN classifier. Accordingly, 103228 frames were extracted from the Physionet 2016 dataset. To report the results, they repeated their experiments in five iterations and reported the average of the obtained results. The attained results in terms of sensitivity, specificity and mean accuracy denoted 0.845, 0.785 and 0.815, respectively.

5.2. Effects of the Number of the GMM Components and i-vectors Dimensionality

The first part of our experiments was performed to investigate the effects of the number of GMM components, the effects of i-vectors dimension numbers without applying VAE or PCA and finally, effects of i-vector dimension reduction through applying PCA and VAE on the proposed system. Fig. 4 and Fig. 5 represent the MAccs on the test set using the mentioned approaches. It is worth mentioning that we did not label any data as “unsure”, and the label “normal” or “abnormal” is assigned to all test data.

Figure 4: MAcc comparison based on different GMM components, raw i-vector dimension and dimension of i-vector using ”PCA” for the proposed method. Here the Acronym ”W.A” means ”Without Applying” PCA.

Figure 5: MAcc comparison based on different GMM components, raw i-vector dimension and dimension of i-vector using ”VAE” for the proposed method. Here the Acronym ”W.A” means ”Without Applying” VAE.

In Fig. 4 and 5 the number of components used in GMMs is specified separately in each plot. Fig. 4 and Fig. 5 show i-vector generally performs better after the application of VAE or PCA. The best results are achieved by higher dimensions of i-vector and after the application of VAE. Fig. 4 denotes the results of i-vector and its PCA. It can be seen that the values obtained after employing PCA are not as good as those obtained by employing VAE.

Discussion: The higher performance of VAE is due to the fact that it aims to minimize the cost function which is defined as MSE between input features and output (reconstructed features). PCA merely seeks to extract important information, whereas VAE attempts to extract features with the capability to produce original data. As a result, VAE can extract valuable information which is able to produce original data as much as it can and that is why MAcc is reducing over time. On the other hand, increase in dimension of raw i-vector might add useless, sparse features to the feature vector and this leads to classification error and accuracy reduction. Generally, the best MAcc values are obtained by the GMMs trained by 128 components. In the proposed system, the GMMs are not well trained with 64 components. Conversely, engaging 256 components cause over-fitting, due to the low amount of training set.

5.3. Effects of i-vector Dimension on Performance

In Fig. 6 The red point-line represents the best values achieved by different dimensions of i-vectors without applying PCA or VAE. Moreover, the blue and green point-lines of Fig. 6 represent the best values obtained by different dimensions of i-vectors and applying PCA and VAE, respectively. According to the Fig. 6, after employing VAE or PCA, the MAcc values subsequently increase. However, this pattern is not true for raw i-vectors which yield different MAcc results.

Discussion: A higher-dimensional i-vector includes more detailed information. On the other hand, this information may include useless details and common information. Therefore, PCA and VAE methods are adopted to make this information more effective. Applying PCA and VAE can significantly improve the result values compared with applying raw i-vectors. In addition, it can be seen that although GMM works better for higher dimensions, SVM has better improvement rate than GMM. This happens owing to the fact that SVM is a method which works based on feature space transformation; therefore, any change in the dimensions of features can be more effective than GMM, which is more data-based classifier. It will be demonstrated in Section 5.4 that GMM displays a better improvement rate when the data is increased, while SVM has a smooth improvement curve.

Figure 6: DAT curve comparison for raw i-vevtor and its PCA and VAE. In each case, results are reported using the best parameters configuration.

5.4. Effect of Training the System using Different Size of Training Set

The section is concerned with evaluating the effect of different sizes of the training set on the proposed method. To satisfy the conditions, the training data was divided into 5 folds (each fold include 20% of the training set) randomly. In the next step, the training set was raised fold by fold each time and the impacts on MAcc improvement was observed. Table 3 shows the influence of applying the different sizes of the training set to our system, with a fixed number of GMM components. This observation revealed better results in the first part of our experiments. The reported values in this table are based on the best results obtained from the different size of raw i-vectors and applying PCA and VAE to them. (In each case, results are reported using the parameters configuration for best results).

As summarized in Table. 3, the classification performance improved by an increase in the amount of training data. The results suggest that increasing the size of training data over 80%, leads to less improvement, in comparison with the cases where the size of the training set is smaller. According to Table. 3, the performance of proposed system is similar to the baseline system when only 60% of the training set is used for training the proposed system. In addition, the performance of SVM is better than the GMM for lower

Table 3: The Effect of Using Different Size of training set on the performance of the Proposed System with GMM and SVM-RBF

amount of data. However, for more amount of training data, GMM system works better than SVM. Fig. 7 depicts the impact of varying training set size on the MAcc values of the proposed system.

Discussion: As shown in these figures, the MAcc improves while the training set size is gradually increasing. Obviously, the number of the samples is crucial in improving the results, since it improves the generalization. Moreover, it helps the system to adapt to new samples. Nevertheless, the principal discussion is about the comparison of three different approaches we engaged to examine whether feature reduction is applicable or not. First, as it can be seen, using the larger dataset for raw i-vector demonstrates lower improvement compared with using PCA. Obviously, the dimension reduction in large scale and small dataset yields better performance than raw i-vector. The most critical point here is that VAE has the best performance. The major reason is that VAE, as one of Deep Neural Networks (DNNs), requires more data to generalize results and as data increases, the results for VAE improve over time. Hence, it yields the best results among all approaches.

It is also worthy of note that GMM has a better improvement rate than SVM. The explanation behind this change is lies in the mechanism of each classifier. GMM uses data distribution to perform classification, while SVM is mostly dependent on feature space rather than data. Thus, increasing data results in more accurate data distribution and consequently, better impact on GMM performance. In addition, performance of SVM has a smooth and low improvement, because adding data in SVM results in better transformation through kernel function, not better feature space, necessarily.

Figure 7: DAT curve comparison for raw i-vector and its PCA and VAE for using different size of training set. In each case, results are reported using the best parameters configuration.

5.5. Best Results Analysis

Table. 4 presents the results obtained by the baseline system and the best results obtained by our proposed systems in this study. Accordingly, the best MAcc is achieved by our proposed system which is 97.34% when i-vector, VAE and GMM are employed. This result improves the accuracy of the baseline system by 15.84%.

Table 4: Best results for Our proposed approaches and baseline system.

Discussion: In the baseline system represented in [53], extracted features are mostly based on frequency and sub-band features; such as MFCC, Mel-Spectrogram, etc. These features are suitable for robust speech or sound detection. However, in other applications like heart sound classification, it is essential to extract an identical features for our purpose. This is due to specific characteristic of heart sound, that is unique for every individual. As a result, i-vector can be better features for heart sound identification. Hence, it can improve classification error and accuracy better than approaches based on robust feature extraction. In addition, GMM has superiority over SVM, since GMM gives a better description of samples in terms of feature space since this classifier obtains this goal with no change in feature space. On the other hand, SVM uses a kernel to map the current feature space to a better one and that can cause a problem, since it may be solved in the current feature space and changing the feature space can increase complexity.

6. Conclusions

This research study proposes a novel method for automatic heart sound classification based on i-vector MFCC features embedding. In this method, MFCC features are extracted from heart sounds, and then i-vectors are obtained based on these features. The achieved i-vectors represent the characteristics of the participants’ heart sound, given that a heart sound is unique for each individual. This method is based on fix-sized i-vector and therefore insensitive to the length of the input sounds. In addition, the i-vector of a heart sound is a more suitable feature to describe the characteristics of heart sound than other variable length features, since the whole sound is considered for i-vector producing. i-vectors are fed to PCA or VAE in order to produce an apt discrimination. Finally, these features are given to GMMs and SVM classifiers for final labeling. The experiments on a public dataset demonstrate the effectiveness of the proposed method. The combination of MFCC and i-vector is stable and can reflect the key point features to discriminate two types of the subject accurately. The proposed method also works well with limited amount of data. In conclusion, the proposed method outperforms the state-of-the-art approach.

7. Acknowledgment

We thank Mr. Mohammad Elmi and Mr. Majid Osati for comments that greatly improved the manuscript.

References

[1] C. Liu, D. Springer, Q. Li, B. Moody, R. A. Juan, F. J. Chorro, F. Castells, J. M. Roig, I. Silva, A. E. Johnson, et al., An open access database for the evaluation of heart sound algorithms, Physiological Measurement 37 (12) (2016) 2181.

[2] S. Sun, Z. Jiang, H. Wang, Y. Fang, Automatic moment segmentation and peak detection analysis of heart sound pattern via short-time modified hilbert transform, Computer methods and programs in biomedicine 114 (3) (2014) 219–230.

[3] Z. Yan, Z. Jiang, A. Miyamoto, Y. Wei, The moment segmentation analysis of heart sound pattern, Computer methods and programs in biomedicine 98 (2) (2010) 140–150.

[4] P. Sedighian, A. W. Subudhi, F. Scalzo, S. Asgari, Pediatric heart sound segmentation using hidden markov model, in: Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE, IEEE, 2014, pp. 5490–5493.

[5] P. Bentley, G. Nordehn, M. Coimbra, S. Mannor, R. Getz, The pascal classifying heart sounds challenge 2011 (chsc2011) results, See http://www. peterjbentley. com/heartchallenge/index. html.

[6] S. Ari, K. Hembram, G. Saha, Detection of cardiac abnormality from pcg signal using lms based least square svm classifier, Expert Systems with Applications 37 (12) (2010) 8019–8026.

[7] Y. Zheng, X. Guo, X. Ding, A novel hybrid energy fraction and entropy-based approach for systolic heart murmurs identification, Expert Systems with Applications 42 (5) (2015) 2710–2721.

[8] H. U˘guz, A biomedical system based on artificial neural network and principal component analysis for diagnosis of the heart valve diseases, Journal of medical systems 36 (1) (2012) 61–72.

[9] A. Gharehbaghi, I. Ekman, P. Ask, E. Nylander, B. Janerot-Sjoberg, Assessment of aortic valve stenosis severity using intelligent phonocardiography, International journal of cardiology 198 (2015) 58–60.

[10] R. Sara¸cO˘gLu, Hidden markov model-based classification of heart valve disease with pca for dimension reduction, Engineering Applications of Artificial Intelligence 25 (7) (2012) 1523–1528.

[11] A. Quiceno-Manrique, J. Godino-Llorente, M. Blanco-Velasco, G. Castellanos-Dominguez, Selection of dynamic features based on time–frequency representations for heart murmur detection from phonocardiographic signals, Annals of biomedical engineering 38 (1) (2010) 118–137.

[12] R. Wahid, N. I. Ghali, H. S. Own, T.-h. Kim, A. E. Hassanien, A gaussian mixture models approach to human heart signal verification using different feature extraction algorithms, in: Computer Applications for Bio-technology, Multimedia, and Ubiquitous City, Springer, 2012, pp. 16–24.

[13] G. D. Clifford, C. Liu, B. Moody, J. Millet, S. Schmidt, Q. Li, I. Silva, R. G. Mark, Recent advances in heart sound analysis, Physiological Measurement 38 (8) (2017) E10–E25. doi:10.1088/1361-6579/ aa7ec8. URL https://doi.org/10.1088%2F1361-6579%2Faa7ec8

[14] C. Potes, S. Parvaneh, A. Rahman, B. Conroy, Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds, 2016 Computing in Cardiology Conference (CinC) (2016) 621–624.

[15] A. K. Dwivedi, S. A. Imtiaz, E. Rodr´ıguez-Villegas, Algorithms for automatic analysis and classification of heart sounds–a systematic review, IEEE Access 7 (2019) 8316–8345.

[16] F. Renna, J. Oliveira, M. T. Coimbra, Convolutional neural networks for heart sound segmentation, in: 2018 26th European Signal Processing Conference (EUSIPCO), IEEE, 2018, pp. 757–761.

[17] F. Noman, C.-M. Ting, S.-H. Salleh, H. Ombao, Short-segment heart sound classification using an ensemble of deep convolutional neural networks, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 1318–1322.

[18] A. Moukadem, A. Dieterlen, N. Hueber, C. Brandt, Localization of heart sounds based on s-transform and radial basis function neural network, in: 15th Nordic-Baltic Conference on Biomedical Engineering and Medical Physics (NBC 2011), Springer, 2011, pp. 168–171.

[19] A. Castro, T. T. Vinhoza, S. S. Mattos, M. T. Coimbra, Heart sound segmentation of pediatric aus- cultations using wavelet analysis, in: Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE, IEEE, 2013, pp. 3909–3912.

[20] S. Patidar, R. B. Pachori, N. Garg, Automatic diagnosis of septal defects based on tunable-q wavelet transform of cardiac sound signals, Expert Systems with Applications 42 (7) (2015) 3315–3326.

[21] Z. Abduh, E. A. Nehary, M. A. Wahed, Y. M. Kadah, Classification of heart sounds using fractional fourier transform based mel-frequency spectral coefficients and stacked autoencoder deep neural network, Journal of Medical Imaging and Health Informatics 9 (1) (2019) 1–8.

[22] S. E. Schmidt, C. Holst-Hansen, C. Graff, E. Toft, J. J. Struijk, Segmentation of heart sound recordings by a duration-dependent hidden markov model, Physiological measurement 31 (4) (2010) 513.

[23] D. B. Springer, L. Tarassenko, G. D. Clifford, Logistic regression-hsmm-based heart sound segmentation, IEEE Transactions on Biomedical Engineering 63 (4) (2016) 822–832.

[24] A. A. Sepehri, J. Hancq, T. Dutoit, A. Gharehbaghi, A. Kocharian, A. Kiani, Computerized screening of children congenital heart diseases, Computer methods and programs in biomedicine 92 (2) (2008) 186–192.

[25] H. U˘guz, Adaptive neuro-fuzzy inference system for diagnosis of the heart valve diseases using wavelet transform with entropy, Neural Computing and applications 21 (7) (2012) 1617–1628.

[26] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing 19 (4) (2011) 788–798.

[27] N. Dehak, P. A. Torres-Carrasquillo, D. Reynolds, R. Dehak, Language recognition via i-vectors and dimensionality reduction, in: Twelfth annual conference of the international speech communication association, 2011.

[28] D. Martinez, O. Plchot, L. Burget, O. Glembek, P. Matˇejka, Language recognition in ivectors space, in: Twelfth Annual Conference of the International Speech Communication Association, 2011.

[29] M. H. Bahari, R. Saeidi, D. van Leeuwen, et al., Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech.

[30] R. Xia, Y. Liu, Using i-vector space model for emotion recognition, in: Thirteenth Annual Conference of the International Speech Communication Association, 2012.

[31] H. Khaki, E. Erzin, Continuous emotion tracking using total variability space, in: Sixteenth Annual Conference of the International Speech Communication Association, 2015.

[32] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, G. Widmer, Cp-jku submissions for dcase-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks, IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE).

[33] M. Adiban, H. Sameti, N. Maghsoodi, S. Shahsavari, Sut system description for anti-spoofing 2017 challenge, in: Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017), 2017, pp. 264–275.

[34] M. R. Hasan, M. Jamil, M. Rahman, et al., Speaker identification using mel frequency cepstral coeffi- cients, variations 1 (4).

[35] V. Tiwari, Mfcc and its applications in speaker recognition, International journal on emerging technolo- gies 1 (1) (2010) 19–22.

[36] M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognition 44 (3) (2011) 572–587.

[37] X. Wei, L. Wenju, et al., Multilingual i-vector based statistical modeling for music genre classification.

[38] H. Zeinali, B. BabaAli, H. Hadian, Online signature verification using i-vector representation, IET Biometrics.

[39] H. Zeinali, A. Mirian, H. Sameti, B. BabaAli, Non-speaker information reduction from cosine similarity scoring in i-vector based speaker verification, Computers & Electrical Engineering 48 (2015) 226–238.

[40] D. A. Reynolds, T. F. Quatieri, R. B. Dunn, Speaker verification using adapted gaussian mixture models, Digital signal processing 10 (1-3) (2000) 19–41.

[41] W. M. Campbell, D. E. Sturim, D. A. Reynolds, A. Solomonoff, Svm based speaker verification using a gmm supervector kernel and nap variability compensation, in: Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, Vol. 1, IEEE, 2006, pp. I–I.

[42] A. Solomonoff, C. Quillen, W. M. Campbell, Channel compensation for svm speaker recognition., in: Odyssey, Vol. 4, Citeseer, 2004, pp. 219–226.

[43] A. Solomonoff, W. M. Campbell, I. Boardman, Advances in channel compensation for svm speaker recognition, in: Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05). IEEE International Conference on, Vol. 1, IEEE, 2005, pp. I–629.

[44] A. O. Hatch, S. Kajarekar, A. Stolcke, Within-class covariance normalization for svm-based speaker recognition, in: Ninth international conference on spoken language processing, 2006.

[45] N. Dehak, P. Kenny, R. Dehak, O. Glembek, P. Dumouchel, L. Burget, V. Hubeika, F. Castaldo, Support vector machines and joint factor analysis for speaker verification.

[46] C. R. Rao, The utilization of multiple measurements in problems of biological classification, Journal of the Royal Statistical Society. Series B (Methodological) 10 (2) (1948) 159–203.

[47] L. Van Der Maaten, E. Postma, J. Van den Herik, Dimensionality reduction: a comparative, J Mach Learn Res 10 (2009) 66–71.

[48] H. Abdi, L. J. Williams, Principal component analysis, Wiley interdisciplinary reviews: computational statistics 2 (4) (2010) 433–459.

[49] D. Jang, H. Park, G. Choi, Estimation of leakage ratio using principal component analysis and artificial neural network in water distribution systems, Sustainability 10 (3) (2018) 750.

[50] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114.

[51] D. Reynolds, Gaussian mixture models, Encyclopedia of biometrics (2015) 827–832.

[52] D. A. Reynolds, Automatic speaker recognition using gaussian mixture speaker models, in: The Lincoln Laboratory Journal, Citeseer, 1995.

[53] B. Bozkurt, I. Germanakis, Y. Stylianou, A study of time-frequency features for cnn-based automatic heart sound classification for pathology detection, Computers in biology and medicine.

development of a large vocabulary speech recognition system for Persian language. His research interests span the areas of statistical machine learning and pattern recognition, sequential pattern labeling, and deep learning and he has authored and co-authored more than 70 scientific papers in these fields.

Designed for Accessibility and to further Open Science