Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments

2017·Arxiv

Abstract

Abstract

Multichannel linear filters, such as the Multichannel Wiener Filter (MWF) and the Generalized Eigenvalue (GEV) beamformer are popular signal processing techniques which can improve speech recognition performance. In this paper, we present an experimental study on these linear filters in a specific speech recognition task, namely the CHiME-4 challenge, which features real recordings in multiple noisy environments. Specifically, the rank-1 MWF is employed for noise reduction and a new constant residual noise power constraint is derived which enhances the recognition performance. To fulfill the underlying rank-1 assumption, the speech covariance matrix is reconstructed based on eigenvectors or generalized eigenvectors. Then the rank-1 constrained MWF is evaluated with alternative multichannel linear filters under the same framework, which involves a Bidirectional Long Short-Term Memory (BLSTM) network for mask estimation. The proposed filter outperforms alternative ones, leading to a 40% relative Word Error Rate (WER) reduction compared with the baseline Weighted Delay and Sum (WDAS) beamformer on the real test set, and a 15% relative WER reduction compared with the GEV-BAN method. The results also suggest that the speech recognition accuracy correlates more with the Mel-frequency cepstral coefficients (MFCC) feature variance than with the noise reduction or the speech distortion level.

Keywords: rank-1 multichannel Wiener filter, speech recognition, residual

noise power, deep neural network.

1. Introduction

Robust machine speech recognition in real environments is a common interest for the signal processing and speech recognition communities [1]. It has been a challenging task for decades. One main reason is that the target speech is corrupted by various background noises. Signal processing methods are able to extract the desired source from corrupted measurements and to improve the recognition accuracy. For this purpose, multichannel techniques improve over single-channel techniques by exploiting information not only in the time-frequency domain but also in the spatial domain.

Multichannel linear filters, also known as beamformers, have been amply investigated in the literature [2, 3]. Nevertheless, only a few approaches have found widespread use in the speech recognition community until recently; these include the Weighted Delay and Sum (WDAS) beamformer in BeamformIt [4] and the Minimum Variance Distortionless Response (MVDR) beamformer in BTK. Recent works have explored more extensive beamforming implementations in the scope of speech recognition [5, 6, 7], and the outcomes of these works indeed benefit both signal processing and speech recognition communities. On the one hand, multichannel algorithms designed to suppress noise [8], reverberation [9] or competing speech, can be used as preprocessing steps for speech recognition. Though they are in general intended for improving the speech perceptual quality [10], some improvements are typically also achieved in terms of speech recognition performance. On the other hand, the speech recognition application inspires many new beamforming architectures [11, 12]. The recognition accuracy metric can also highlight an algorithm from a different perspective [13].

Remarkably, Deep Neural Network (DNN) based linear filtering has gained popularity with its success in recent speech recognition challenges [14, 15, 16, 17]. A regression DNN can be used to predict the speech spectra and combined with the classical multichannel Gaussian model to derive a Multichannel Wiener Filter (MWF) [14, 15]. Alternatively, a Bi-directional Long Short-Term Memory (BLSTM) network can be applied as a classification model to predict a spectral mask and combined with the MVDR beamformer or the Generalized Eigenvalue (GEV) beamformer [16, 17]. The mask is used in the calculation of the source covariance matrix, from which the linear filter coefficients are obtained. Deep neural networks have proved to be more capable of estimating the speech secondorder statistics or the speech presence probability than traditional methods.

Among the above linear filters, the MVDR beamformer is theoretically designed to be distortionless [18], while the GEV beamformer is targeted to achieve maximum output Signal-to-Noise Ratio (SNR) [19]. MWF [20] is a Minimum Mean Square Error (MMSE) solution which allows for given noise reduction at the expense of some speech distortion. There exist other linear filter variants, such as the Speech Distortion Weighted MWF (SDW-MWF) [21, 22, 23] and the Variable Span (VS) linear filter [24]. The SDW-MWF involves a trade-off parameter which tunes the speech distortion versus the noise reduction. In the case of a single target source, it can be expressed in the form of a spatialprediction MWF [25] or a rank-1 MWF [26]. Note that these linear filters are all equivalent up to a scaling factor if formulated in a unified framework [24, 27, 28]. While the speech quality performance of these filters has been well studied, the comparison in terms of speech recognition performance is lacking. An interesting question is whether the already known speech quality performance can be related to the speech recognition accuracy.

In this paper, we provide an extensive experimental study of the relative performance of these multichannel linear filters, considering the real world speech recognition task in multiple noisy environments of the CHiME-4 challenge [29]. In particular, we focus on a family of rank-1 MWF variants. We propose a new constraint of constant residual noise power along both time and frequency,

Figure 1: System illustration with BLSTM supported linear filters. is the speech covariance matrix and is the noise covariance matrix.

which links the rank-1 MWF and the GEV beamformer. This constraint is shown to enhance the speech recognition performance. To fulfill the underlying rank-1 assumption, we introduce a speech covariance matrix reconstruction process. The reconstruction is based on eigenvectors or generalized eigenvectors. In the experiments, all linear filters are supported by the same BLSTM network, which is used for mask estimation. An overview of the system is given in Fig. 1. We also introduce a novel feature variance metric that correlates well with the Word Error Rate (WER) and helps understanding the benefit of the proposed constant residual noise power constraint.

The rest of this paper is organized as follows. The multichannel signal processing problem is formulated in Section 2. In Section 3, the rank-1 MWF solution is first introduced. Three filter variants, including the novel constant residual noise power filter, are then derived separately. To fulfill the rank-1 assumption in practice, the eigenvector based speech covariance matrix reconstruction is discussed in Section 4. The speech recognition experiments, the BLSTM network for mask estimation, the results and the analysis are presented in Section 5. Conclusions are drawn in Section 6.

2. Problem formulation

The multichannel signal processing problem is formulated as follows. A target speech source s propagates in the acoustic space and impinges on an

array of M microphones. The observations at time t are given by

where denotes convolution, is the time-invariant acoustic impulse response from the source to the mth microphone and is the undesired noise at microphone m. Under the narrowband assumption [28], the above model can be

written in the frequency domain as

where l and k are respectively the frame index and the frequency index. ), S(l, k) and ) denote the Short-Time Fourier Transform (STFT) coeffi-cients of ) and ), respectively, and ) is the Fourier transform of ) is the narrowband approximation of the reverberated source.

Linear filtering techniques aim to design an optimal filter which extracts the desired source and suppresses the other components, where subscript denotes transposition. This filter is applied to the

observation vector , and the filter output is

where denotes Hermitian transpose, and

n(l, k(l, k), ..., N(l, k)].

The filter coefficients can be derived by setting certain constraints on the filtered output, for instance, to achieve MMSE with respect to an arbitrary channel of the reverberated source, say ). This is expressed as the optimization problem:

where means expectation. Assuming speech and noise are uncorrelated,

we can rewrite (4) as

where the first term is the speech distortion and the second term is the residual noise power. A weight can be introduced to control the contribution of the second term:

The solution of this weighted optimization problem is known as the SDW-

MWF [21]

where is the speech covariance matrix, is the noise covariance matrix and 0]is an M-dimensional vector that projects on the first channel. The hyperparameter in the SDW-MWF controls the trade-off between speech distortion and noise reduction. A larger value of leads to more noise reduction at the expense of more speech distortion. Specially, the plain MWF is obtained with

3. Rank-1 MWF variants

In the following, we first review the rank-1 MWF solution [26]. Then three filter variants, namely the minimum distortion filter, the plain rank-1 MWF and the new constant residual noise power filter, are derived separately by finding the proper trade-off parameter values. While the first two variants were discussed in [26], the last one is obtained here by analysing the GEV beamformer [19], a filter that maximizes the output SNR. We show that the GEV beamformer also features a constant residual noise power property over both time and frequency. The new rank-1 MWF variant is then derived following this constraint.

3.1. Rank-1 MWF

Under the narrowband approximation (2), the speech covariance matrix can

be decomposed as

where denotes the speech power spectral density and is the vector of acoustic transfer functions. This matrix is of rank-1. Thus

where tris the trace operation. With Woodbury’s identity and the fact that

the SDW-MWF solution ends up in the rank-1 MWF

Similarly, the trade-off parameter controls the speech distortion and noise reduction performance. With different parameter values, the corresponding filter variants exhibit different properties.

3.2. Minimum distortion filter and plain rank-1 MWF

These two filter variants match the cases of

is indeed distortionless in theory.

3.3. Constant residual noise power filter

To derive the new filter variant, we first investigate the maximum SNR filter

that is defined as

This is a generalized Rayleigh quotient and the GEV solution is

where takes the eigenvector corresponding to the largest eigenvalue, which is defined up to an arbitrary scale. An additional Blind Analytical Normalization (BAN) post-filter can be applied to control the speech distortion [19]. The output SNR of the GEV beamformer is equal to the largest eigenvalue of ), which is exactly in the rank-1 case.

Meanwhile, the two Hermitian matrices ) and ) can be jointly diagonalized as

where B and are respectively the eigenvectorand eigenvalue matrices of

), and I is the identity matrix [24]. If the diagonal elements of are in descending order, then the GEV beamformer (15) can be chosen as the first column vector of B. This is the usual choice made in the literature [19] and the one we also make in the following. We denote it by . By defining the residual noise as , we see

that the residual noise power of the GEV is given by

which indicates constant residual noise power over both frequency and time.

Going back to the rank-1 MWF, it can be proved that the rank-1 MWF solution also satisfies (14) with an arbitrary trade-off parameter. The general

expectation of the residual noise power is

in which the final step makes use of equation (10). Setting the residual noise power to a constant value

equation (18), we obtain

which has become frame and frequency dependent. Thus a rank-1 MWF filter which is similar to the GEV in terms of maximizing the output SNR and leading to constant residual noise power, but different in terms of projection direction,

is given by

This choice of is new in the context of rank-1 MWF. Although it has been known that linear filters are equivalent up to a scaling factor [24, 27, 28], the factor that specifically relates the rank-1 MWF and GEV is given here by for the first time.

In [30], the residual noise power was chosen as constant over time. Here we restrict it to be constant along frequency too. Note that, under this constraint, the signal can be amplified in some noise-dominated frequency bins and weakened in some speech-dominated frequency bins, which induces speech distortion as does the GEV beamformer. Nevertheless, the derived three rank-1 MWF variants differ only by the spectral shape of the filtered signal. They all project in the spatial direction of , but with different spectral gains.

4. Rank-1 constraint on the speech covariance matrix

The above linear filters are specified as functions of the covariance matrices: ) and ). In practice, the covariance matrices need to be

estimated either by recursive smoothing

or by the arithmetic mean

where is a forgetting factor, and represent the speech and noise masks or the speech and noise presence probabilities, respectively. Due to estimation errors or to the fact that the narrowband assumption (8) doesn’t hold perfectly, the estimated speech covariance ) is not rank-1. In [23], using a low-rank approximation of the speech covariance matrix in the SDW-MWF effectively delivered better noise reduction performance. This motivates us to constrain the estimated speech covariance matrix to be rank-1 as follows.

The matrix can be decomposed into a rank-1 part and a remainder part:

where , and a(l, k) is defined as the reconstruction vector. The remainder matrix ) can be either treated as noise or simply ignored, leading to different interpretations of the filter [23]. We choose to ignore the remainder part here. a(l, k) is chosen from the eigenvector and the generalized eigenvector, that are defined as:

Note that is interpreted as the desired source relative transfer function in [31]. These two expressions result in new EVD and GEVD based filters, respectively, that fulfill the rank-1 assumption used for deriving the rank-1 MWF.

The new filters are given by

5. Experiments and analysis

5.1. The recognition task

The experiments are conducted on the CHiME-4 challenge data [29]. This dataset features real recordings in four daily noise environments: bus, cafeteria, street junction and pedestrian area. Sentences from the Wall Street Journal (WSJ0) 5k corpus are read from a tablet device. Then the audio signals are captured by a 6-channel microphone array embedded in the tablet frame. For subsequent processing, the signals are downsampled to 16kHz. Besides the real recordings, there are also artificially generated sentences. Clean WSJ0 samples are mixed with the environment noises at similar SNRs as the real data. The whole dataset is divided into disjoint training, development and evaluation sets. In the training set, there are 1600 real and 7138 simulated sentences, about 20 hours in total. In the development set and the test set, there are 1640 and 1320 sentences for each kind of data.

The recognition system is the official challenge baseline built with the Kaldi toolkit. The inputs to the DNN acoustic model are Mel-frequency Cepstral Co-efficient (MFCC) features processed by feature space Maximum Likelihood Linear Regression (fMLLR) transformation. The outputs are 1979 Hidden Markov Model (HMM) probability states. The acoustic model has 7 layers that are trained under the state level Minimum Bayes Risk (sMBR) criterion. In the decoding phase, a 3-gram Language Model (LM) is used. Recurrent neural network (RNN) LM rescoring is not applied in our experiments: this is the only difference with respect to the official baseline. The results obtained here are not meant to be compared to the best CHiME-4 results, where advanced acoustic and RNN language models are applied.

5.2. Evaluation setup

Table 1: Linear filters involved in the evaluation. They are organized in terms of the projection direction and spectral gain in order to highlight their differences or similarities. The filter h is given by the product of the projection direction and the spectral gain. Note that depend on time and frequency.

The WDAS beamformer [4] is provided as the official baseline for CHiME-4. The linear filters involved in the evaluation are listed in Table 1. They are organized in terms of the projection direction and the spectral gain. GEV-BAN was the method used in the best CHiME-4 submissions [29].

The linear filters are based on the same BLSTM network which simultaneously predicts the speech mask and the noise mask . In [13], the network was combined with MVDR and GEV. We extend the process here to other linear filters. The STFT is performed in 1024 points with 256 points shift. The magnitude spectrum vector of one frame is used as input. The network consists of one recurrent BLSTM layer with 256 nodes and two feed-forward hidden

Figure 2: Illustration of the BLSTM network for mask prediction. The numbers in brackets indicate the number of nodes per layer.

layers with 512 nodes each. The outputs are 1026 nodes for the speech mask

and the noise mask. The target ideal masks are defined as

where the thresholds for speech and noise detection and are set to be 0 dB and -10 dB, respectively. The thresholds are chosen to favor a speech/noise decision with low false acceptance rate. This results in more reliable covariance matrix estimation at the cost of discarding some time-frequency bins [16]. The ReLU activation function is used for all the hidden layers while the sigmoid function is chosen for the output layer. The network is totally single-channel based, i.e., it operates on each microphone signal independently. An illustration of the network architecture is shown in Fig. 2.

In the training stage, the network is trained with all the simulated training utterances from the 6 channels. The simulated data from the development set is used for cross validation and early stopping. The weights of the BLSTM layer are initialized from a uniform distribution ranging from -0.05 to 0.05. The other layers are initialized with samples from a normal distribution with zero mean and a variance ofwith denoting the number of input units. The Adam method [32] is employed to tune the network and the learning rate is adjusted adaptively. Cross-entropy loss is used as the optimization criterion. For better generalization performance, dropout is applied to all the hidden layers. The dropout rate is fixed to 0.5. Batch normalization [33] is applied to speed up the training process and help the network converge to a better local optimum.

In the test phase, the magnitude spectrum vector of the test signal is fed to the trained model and the output masks are in the [0, 1] range. The masks are obtained separately for each channel, and the median value is taken across channels. The median operation is robust to outliers in the case of microphone failure in the real recordings [13]. This value is then used to obtain using (23) and (24). The statistics are averaged on the whole sentence, which leads to time-invariant filters per utterance, that have shown to be more advantageous than time-varying ones for this speech recognition task [29]. For the rank-1 MWF, the reference channel is decided by cross-channel correlations. The channel which has the highest average correlation score with the other channels is selected as the reference.

The experimental setup follows the CHiME-4 challenge instructions: no extra information, such as the environment label, is exploited. The source code is

5.3. Recognition results - Acoustic model trained on noisy data

In the first experiment, two acoustic models are trained with the noisy data: one with utterances from the official channel 5 (20h) and the other with utterances from all 6 channels (120h). The involved linear filters are only applied to the development data and the test data. The WER results are given in Table 2.

From an overall perspective, the results of the linear filters follow the same trends for both acoustic models. The filers consistently enhance the recognition performance and lower WERs are achieved as expected with more training data. The performance difference between simulated data and real data is small on

Table 2: WERs (%) achieved by the DNN-sMBR system trained on noisy data. The best result for each dataset is in bold.

the development set. The overall higher error rates on the test set are due to the fact that the speakers of the test set speak in a less intelligible way [34]. The following discussions concentrate on the results achieved on the test set with the acoustic model trained on utterances from all 6 channels.

Compared with the noisy baseline, it is obvious that all the multichannel methods improve the speech recognition performance. The WDAS beamformer

Figure 3: Illustration of the r1MWF-filter for an example sentence (M06 440C0201 BUS) from the real test set. (a) Spectral gain along frequency for different values of corresponding log-magnitude of one frame of the filtered signals. (c) and (d) Log-magnitude spectrograms of the filtered signals with , respectively.

is a simple but effective technique, which delivers 35% relative WER reduction on the real data. The MWF achieves less reduction here partly due to its sensitivity to mask estimation errors [35]. The MVDR filter is theoretically speech distortionless and further improvement is achieved from the WDAS filter. For instance, the WER is reduced from 12.86% to 8.89% on the real data. The GEV and GEV-BAN surprisingly lead to comparable results, despite the fact that BAN is believed to be crucial to the speech perceptual quality [19]. There is around 1% absolute difference on the simulated data though. The VS filter gets the lowest WER among the above ones. It is especially effective on the simulated data with an average 25% relative improvement from the MVDR filter. The recognition performance is clearly influenced by the projection direction of the beamformers as shown by the GEV, MWF filters and the VS, r1MWF-1 filters.

Regarding the rank-1 MWF variants without speech covariance matrix reconstruction, the distortionless r1MWF-0 works best on the simulated data while the residual noise power constrained r1MWF-works best on the real data. By changing the trade-off parameter from 0 to {1,5,10}, more noise reduction is achieved in the processed signal at the expense of more speech distortion. This results in worse recognition performance in this specific task: WERs increase as increases. Note that for the r1MWF-, this trade-off parameter is frequency dependent. In Fig. 3, the spectral gain along frequency and the filtered signals are shown for different parameter values. The r1MWF-has small gain in the low frequencies and puts more weight in the high frequencies, leading to relatively stable level of log-magnitudes as shown in Fig 3 (b) and Fig 3 (d). The differences in the (time-varying) spectral gain result in different recognition accuracies.

Additional improvement is observed with the speech covariance matrix reconstruction process. On the real test data, the WER is reduced from 8.89% to 8.09% for the r1MWF--evd and 7.71% for the r1MWF--gevd. Overall, the r1MWF--gevd gives the best result. It achieves a 40% relative WER reduction compared with the baseline WDAS beamformer on the real test set and a 15% relative WER reduction compared with the GEV-BAN method.

An interesting experiment is to check the performance of these filters using the correct masks instead of the predicted ones. This presumably would help to partially discriminate the error rate caused by covariance estimation errors and

Table 3: WERs (%) achieved by the DNN-sMBR system trained on noisy data from channel 5. These filters are computed from the correct masks. The percentages in brackets denote the relative WER changes from the results obtained with the BLSTM predicted masks. The best result for each dataset is in bold.

limitations of the multichannel linear filters themselves. The correct masks for the simulated data are well defined, however, the ground truth underlying the real data is not readily available. The method in [29] is adopted for the ground truth estimation for real data and then the masks are calculated using (29) and (30). The recognition results are summarized in Table 3, with the percentages in brackets denoting the relative WER changes from the results in the left half of Table 2, that are obtained with the BLSTM predicted masks.

The relative performance between the linear filters is generally consistent with the previous results, though a reduction of WERs on the simulated data is observed and an overall increase of WERs is observed on the real data. For instance, the WERs of GEV-BAN on the test set decrease by 29% relative on simulated data and increase by 14% relative on real data. This indicates that GEV-BAN would benefit from better estimated masks on simulated data. This also indicates that the ground truth estimation process is not perfect and GEVBAN is prone to covariance estimation errors. In comparison, the VS filter is more robust to mask misestimation and achieves the lowest WERs on the development set. Comparing r1MWF--evd and r1MWF--gevd to r1MWF- , the rank-1 constraint on the speech covariance matrix still leads to lower error rates. On the real test data, r1MWF--gevd achieves a 16% relative WER reduction compared with the GEV-BAN method in this case.

5.4. Recognition results - Acoustic model trained on enhanced data

Table 4: WERs (%) achieved by the DNN-sMBR system trained on enhanced data. The best result for each dataset is in bold.

In the second experiment, the acoustic model is retrained with the filtered

training data. The WERs are shown in Table 4. They are comparable to the left half of Table 2 in the sense that the amount of training data is the same.

On the real data, all linear filters generally achieve higher error rates than in the first experiment, except for the GEV filter. On the simulated data, the WERs are generally lower. The proposed r1MWF--gevd is still the best on real data. Note that retraining the acoustic model every time is rather time-consuming and not efficient in practice. The results here provide a strong argument for noisy training, that extends the argument made specifically for the GEV-BAN in [13].

5.5. Analysis

The above results suggest that neither speech distortion nor noise reduction is straightforwardly correlated with the speech recognition performance. Indeed, the GEV introduces more speech distortion than the theoretically distortionless MVDR but it performs better in the second experiment. The r1MWF-5/10 are supposed to deliver more noise reduction than the r1MWF-0 but they give higher WERs.

In the following, we investigate the rank-1 MWF variants and their WERs achieved on the noisy acoustic model trained on utterances from all 6 channels. In Fig. 4, the relation between the WERs and the speech distortion scores is shown. The frequency-weighted log-spectral Signal Distortion (SD) metric [36]

is defined as

where L is the number of frames, and are respectively the processed speech power spectrum and the clean speech power spectrum, and ERB(k) is the frequency-weighting factor giving equal weight to each auditory critical band. The SD scores are computed and averaged on the simulated test data. We observe that the r1MWF-introduces much larger distortion than the r1MWF-0/1, from about 9 dB to 16 dB. But the WER only increases slightly. Clearly, there is no strong correlation between the two.

Figure 4: WERs achieved on the acoustic model trained on utterances from all 6 channels and SD scores of the r1MWF variants. WERs are represented by white bars and SD scores are marked by circles.

Figure 5: Relation between FV and WER for the real (solid black markers) and simulated (hollow red markers) data. The dashed lines show the linear regression results separately. A line with positive slope means positive correlation.

In order to explain the recognition performance, we investigate the variance of the input features corresponding to each HMM state in Fig. 5. The intuition is that smaller Feature Variance (FV) implies an easier classification task for the neural network acoustic model. We expect the constant residual noise power property of the r1MWF-to translate into a smaller FV for the processed speech. The HMM state corresponding to each feature vector is first obtained by forced alignment on enhanced data separately. Note that the alignments of the simulated data can be obtained using the clean speech, nevertheless, similar results are observed here. The FV is calculated over all the feature vectors

belonging to each HMM state for each method

where Var(i, j) means the variance of the ith feature in the jth state. We pick

the FV of the r1MWF-0 as a baseline and define the metric

that is the weighted percentage of states for which the FV is larger than the baseline. denotes the number of occurrences of the jth state. ) is an indicator function the value of which is 1 for true arguments and 0 for false. For a comparable method, the value is expected to be around 50%. On the real data, the r1MWF-1 and r1MWF-5 have higher percentages (62.7% and 68.4%) and corresponding higher WERs. For the r1MWF--evd and r1MWF- -gevd, lower percentages (23.1% and 23.6%) correlate with lower WERs. However, the correlation is not always valid on the simulated data as shown by the r1MWF-: it has 43.7% states with smaller FV and yet a higher WER than the baseline.

The FV metric provides another view from the feature side to explain the performance of the constant residual noise filter. Note that a global scale factor only results in a shift in the 0th MFCC value and will not affect the feature variance. The computation of FV also avoids the time-consuming decoding procedure that is required for WER.

6. Conclusion

Multichannel linear filters are generally designed to improve the speech perceptual quality but not specifically to improve the speech recognition accuracy. As a matter of fact, the choice of the optimal filter may be different for different tasks. In the scenario of a single target source, the popular SDW-MWF can be formulated as the rank-1 MWF. We derived a family of rank-1 MWF variants and evaluated their performance for speech recognition in multiple noisy environments. We defined a constant residual noise power constraint to find the trade-off parameter which links the rank-1 MWF filter and the GEV beamformer. We showed that this constraint brings more speech distortion, however, it benefits the speech recognition performance on the real data. To fulfill the underlying rank-1 assumption, speech covariance matrix reconstruction is proposed. The reconstruction based on eigenvectors or generalized eigenvectors subsequently improves the recognition accuracy. With experiments conducted on the CHiME-4 dataset, the final r1MWF--gevd filter achieved a 40% relative WER reduction compared with the baseline WDAS beamformer on the real test set and a 15% relative WER reduction compared with the GEV-BAN method. For future research, we would like to see how the performance is impacted for corpora with higher reverberation time where the narrowband approximation becomes more erroneous.

In the speech recognition task, it is observed that multi-condition noisy training works well and sometimes outperforms retraining with enhanced data. So when new signal processing methods are applied, a reasonable practice is to process only the test data. Another finding is that the speech perceptual quality is not straightforwardly related to the speech recognition performance. An investigation from the perspective of feature variance is provided. The work puts forward the need for novel signal or feature metrics that correlate better with the WER.

7. Acknowledgements

We would like to thank the anonymous reviewers for their constructive comments. This work was supported by the China Scholarship Council (No. 201604910623). Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other

References References

[10] T. Van den Bogaert, S. Doclo, J. Wouters, M. Moonen, Speech enhance-

[11] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer,

[12] B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, M. Bacchiani, Neural

[13] J. Heymann, L. Drude, R. Haeb-Umbach, Neural network based spectral

[14] S. Sivasankaran, A. A. Nugraha, E. Vincent, J. A. Morales-Cordovilla,

[15] A. A. Nugraha, A. Liutkus, E. Vincent, Multichannel audio source sep-

[16] J. Heymann, L. Drude, A. Chinaev, R. Haeb-Umbach, BLSTM supported

[17] H. Erdogan, T. Hayashi, J. R. Hershey, T. Hori, C. Hori, W.-N. Hsu,

[18] H. Cox, R. M. Zeskind, M. Owen, Robust adaptive beamforming, IEEE

[19] E. Warsitz, R. Haeb-Umbach, Blind acoustic beamforming based on gen-

[20] S. Doclo, M. Moonen, GSVD-based optimal filtering for single and multi-

[21] A. Spriet, M. Moonen, J. Wouters, Spatially pre-processed speech distor-

[22] S. Doclo, A. Spriet, J. Wouters, M. Moonen, Frequency-domain criterion

[23] R. Serizel, M. Moonen, B. Van Dijk, J. Wouters, Low-rank approximation

[24] J. R. Jensen, J. Benesty, M. G. Christensen, Noise reduction with optimal

[25] J. Benesty, J. Chen, Y. Huang, Noncausal (frequency-domain) optimal fil-

[26] M. Souden, J. Benesty, S. Affes, On optimal frequency-domain multichannel

[27] J. Benesty, M. Souden, J. Chen, A perspective on multichannel noise re-

[28] S. Gannot, E. Vincent, S. Markovich-Golan, A. Ozerov, A consolidated

[29] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, R. Marxer, An analy-

[30] S. Braun, K. Kowalczyk, E. A. Habets, Residual noise control using a

[31] S. Markovich, S. Gannot, I. Cohen, Multichannel eigenspace beamforming

[32] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv

[33] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network train-

[34] J. Barker, R. Marxer, E. Vincent, S. Watanabe, The third CHiME speech

[35] B. Cornelis, M. Moonen, J. Wouters, Performance analysis of multichan-

[36] C. H. Taal, R. C. Hendriks, R. Heusdens, J. Jensen, An evaluation of

designed for accessibility and to further open science