Recently, with the advancement of deep learning, great progress has been made in end-to-end (E2E) automatic speech recognition (ASR). With the goal of directly mapping a sequence of speech frames to a sequence of output tokens, an E2E ASR system incorporates the acoustic model, language model and pronunciation model of a conventional ASR system into a single deep neural network (DNN). The most dominant approaches for E2E ASR include connectionist temporal classification (CTC) [1, 2], recurrent neural network transducer (RNNT) [3] and attention-based encoder-decoder (AED) models [4, 5, 6].
However, the performance of E2E ASR degrades signifi-cantly when an acoustic mismatch exists between training and test conditions. An intuitive solution is domain adaptation where a well-trained source-domain E2E model is adapted to the data in the target domain. Different from speaker adaption, domain adaptation allows for the usage of a large amount of adaptation data in both source and target domains.
There has been plenty of domain adaptation methods for hybrid systems that we can leverage for adapting E2E systems. One popular approach is the adversarial learning in which an intermediate deep feature [7, 8, 9] or a front-end speech feature [10, 11] is learned to be invariant to the shifts between source and target domains. Adversarial domain adaptation is suitable for the situation where no transcription or parallel adaptation data in both domains are available. It can also effectively suppress the environment [12, 13, 14] and speaker [15, 16] variability during domain adaptation. However, in speech area, a parallel sequence of target-domain data can be easily simulated from the source-domain data such that the speech from both domains are frame-by-frame synchronized. To take advantage of this, teacher-student (T/S) learning [17] was proposed for the unsupervised domain adaptation of acoustic models in DNN-hidden Markov model (HMM) hybrid systems [18]. In T/S learning, the KullbackLeibler (KL) divergence between the output senone distributions of teacher and student acoustic models given parallel source and target domain data at the input is minimized by updating only the student model parameters. T/S training was shown to outperform the cross entropy training directly using the hard label in the target domain [18, 19, 20, 21, 22].
One drawback of unsupervised T/S learning is that, the teacher model is not perfect and will sometimes make inaccurate predictions that mislead the student model toward suboptimal directions. To overcome this, one-hot ground-truth labels are used to compensate for teacher’s imperfections. Hinton et al. proposed interpolated T/S (IT/S) learning [23] to interpolate the teacher’s soft class posteriors with one-hot ground truth using a pair of globally fixed weights. However, the optimal weights are data-dependent and can only be determined through careful tuning on a dev set. More recently, conditional T/S (CT/S) learning was proposed in [21] where the student model selectively chooses to learn from either the teacher or the ground truth depending on whether the teacher’s prediction is correct or not. CT/S does not disturb the statistical relationships among classes naturally embedded in the class posteriors and achieves significant word error rate (WER) improvement over T/S for domain adaptation on CHiME-3 dataset [24].
In this work, we focus on the domain adaptation of AED models for E2E ASR by using T/S learning which was previously applied to learn small-footprint AED models in [25, 26, 27] by distilling knowledge from a large powerful teacher AED. For unsupervised domain adaptation, we extend T/S learning to AED models by introducing a two-level knowledge transfer: in addition to learning from the teacher’s soft token posteriors, the student AED also conditions its decoder on the one-best token sequence decoded by the teacher AED.
We further propose an adaptive T/S (AT/S) learning method to improve T/S learning using ground-truth labels. By taking advantage of both IT/S and CT/S, AT/S adaptively assigns a pair of weights to the teacher’s soft token posteriors and the one-hot ground-truth label at each decoder step depending on the confidence scores on each of the labels. The confidence scores are dynamically estimated as a function of soft and one-hot labels. The student AED learns from an adaptive linear combination of both labels. AT/S inherits the linear interpolation of soft and one-hot labels from IT/S and borrows from CT/S the judgement on the credibility of both knowledge sources before merging them. It is expected to achieve improved performance over the other T/S methods for domain adaptation. As a general deep learning method, AT/S can be widely applied to the domain adaptation or model compression of any DNN.
With 3400 hours close-talk and far-field Microsoft Cortana data for domain adaptation, T/S learning achieves up to 24.9% and 6.3% relative WER gains over close-talk and far-field baseline AEDs, respectively. AT/S improves the close-talk and far-field AEDs by 28.2% and 10.3%, respectively, consistently outperforming IT/S and CT/S.
In this work, we perform domain adaptation on AED models [4, 5, 6]. AED model was first introduced in [28, 29] for neural machine translation. Without any conditional independence assumption as in CTC [1], AED was successfully applied to to E2E ASR in [4, 5, 6] and has recently achieved superior performance to conventional hybrid systems in [30].
AED directly models the conditional probability distribution P(Y|X) over sequences of output tokens Y = given a sequence of input speech frames X =
as below:
To achieve this, the AED model incorporates an encoder, a decoder and an attention network. The encoder maps a sequence of input speech frames X into a sequence of high-level features through an RNN. An attention network is used to determine which encoded features in H should be attended to predict the output label
and to generate a context vector
as a linear combination of H [4]. A decoder is used to model P(Y|H) which is equivalent to P(Y|X). At each time step t, the decoder RNN takes the sum of the previous token embedding
and the context vector
as the input to predict the conditional probability of each token, i.e.,
, at the decoder step l, where U is the set of all the output tokens:
where is the hidden state of the decoder RNN. bias
and the matrix
are learnable parameters.
An AED model is trained to minimize the following cross-entropy (CE) loss on the training corpus .
where is the sequence of grouth-truth tokens,
represents the number of elements in
and
denotes all the model parameters in AED.
For unsupervised domain adaptation, we want to make use of a large amount of unlabeled data that is widely available. As shown in Fig. 1, with T/S learning, only two sequences of parallel data are required: an input sequence of source-domain speech frames to the teacher AED and an input sequence of target-domain speech frames to the student model
and
are parallel to each other, i.e, each pair of
and
are frame-by-frame synchronized. For most domain adaptation tasks in ASR, such as adapting from clean to noisy speech, close-talk to far-field speech, wide-band to narrowband speech, the parallel data in the target domain can be easily simulated from the data in the source domain [18, 20].
Our goal is to train a student AED that can accurately predict the tokens of the target-domain data by forcing the student to emulate the behaviors of the teacher. To achieve this, we minimize the Kullback-Leibler (KL) divergence between the token-level output distributions of the teacher and the student AEDs given the parrallel data and
are fed as the input to the AEDs. The KL divergence between the token-level output distributions of the teacher and student AEDs are
Fig. 1. T/S learning for unsupervised domain adaptation of AED model for E2E ASR. The two orange lines signify the two-level knowledge transfer.
formulated below
where is the sequence of one-best to- ken sequence decoded by the teacher AED as follows
where is the number of tokens in
, and
,
denote all the parameters in the teacher and student AED models, respectively. Note that, for unsupervised domain adaptation, the teacher AED can only condition its decoder on the token
predicted at the previous step since the ground-truth la- bels
are not available. We minimize the KL divergence with respect to
while keeping
fixed on the adaptation data corpus A, which is equivalent to minimizing the token-level T/S loss function below:
The steps of token-level T/S learning for unsupervised domain adaptation of AED model are summarized as follows:
1. Clone the student AED from a teacher AED well-trained with transcribed source-domain data by minimizing Eq. (4).
2. Forward-propagate the source-domain data through the teacher AED, generate teacher’s one-best token sequence
using Eq. (6) and teacher’s soft posteriors for each decoder step
by Eqs. (2) and (3).
3. Forward-propagate the target-domain data (parallel to
) through the student AED, generate student’s soft posteriors for each teacher’s decoder step
by Eqs. (2) and (3).
4. Compute error signal of the T/S loss function in Eq. (7) , back-propagate the error through student AED and update the parameters of the student AED.
5. Repeat Steps 2 to 4 until convergence.
After T/S learning, only the adapted student AED is used for testing and the teacher AED is discarded.
From Eqs. (6) and (7), to extend T/S learning to AEDbased E2E models, two levels of knowledge transfer are involved: 1) the student learns from the teacher’s soft token posteriors at each decoder step; 2) the stu- dent AED conditions its decoder on the previous token
predicted by the teacher to make the current prediction.
Sequence-level T/S learning [25, 31] is another method for unsupervised domain adaptation in which a KL divergence between the sequence-level output distributions of the teacher and student AEDs are minimized. Equivalently, we minimize the sequence-level T/S loss function below with respect to
where V is the set of all possible token sequences and the teacher’s sequence-level output distribution is approximated by
for easy implementation.
is an indicator function which equals to 1 if the condition in the squared bracket is satisfied and 0 otherwise.
From Eq. (8), we see that only one level of knowledge transfer exists in sequence-level T/S, i.e., the one-best token sequence decoded by the teacher AED. The student AED learns from
and conditions its decoder on it at each step. Different from token-level T/S, in sequence-level T/S, one-hot labels in
are used as training targets of the student AED instead of the soft token posteriors.
In this section, we want to make good use of the ground-truth labels of the adaptation data to further improve the T/S domain adaptation. Note that different from unsupervised T/S in Section 3, in supervised domain adaptation, the teacher AED conditions its decoder on the ground-truth token instead of its previous decoding result because the token transcription is available in addition to
and
.
One shortcoming of unsupervised T/S learning is that the teacher model can sporadically predict inaccurate token posteriors which misleads the student AED towards suboptimal performance. One-hot ground-truth labels can be utilized to alleviate this issue. One possible solution is the interpolated T/S (IT/S) learning [23] in which a weighted sum of teacher’s soft posteriors and the one-hot ground truth is used as the target to train the student AED. A pair of global weights summed to be one is applied to each pair of soft and one-hot labels. However, the optimal global weights are hard to determine because they are data-dependent and need to be carefully tuned on a dev set.
To address this issue, conditional T/S learning (CT/S) [21] was proposed recently in which the student selectively chooses to learn from either the teacher AED or the ground truth conditioned on whether the teacher AED can correctly predict the ground-truth labels. CT/S have shown significant WER improvements over T/S and IT/S for both domain and speaker adaptation on CHiME-3 dataset. However, in CT/S, the student is still not “smart” enough because, for each token, the student AED solely relies on either the teacher’s posteriors or the ground truth instead of dynamically extracting useful knowledge from both.
To further improve the effectiveness of knowledge transfer, we propose an adaptive teacher-student (AT/S) learning method by taking advantage of both CT/S and IT/S. As shown in Fig. 2, instead of assigning a fixed pair of soft weight w and one-hot weight for all the decoder steps, we adaptively weight the teacher’s soft posteriors at the
decoder step,
, by
and the one-hot vector of the
token in the ground-truth sequence
by
. In order to quantify the value of the knowledge to be transferred,
should be positively correlated with a confidence score
on the teacher’s prediction on token posteriors, while
should be positively correlated with a confidence score on the ground truth
. To achieve this, we compute
by normalizing
against its summation with
.
It is in general true that the higher posterior a teacher assigns to the correct (ground-truth) token
, the more accurate the teacher’s soft posteriors are at this decoder step. Therefore, the confidence score
on teacher’s
Fig. 2. Adaptive T/S (AT/S) learning for supervised domain adaptation of AED model for E2E ASR.
soft posteriors can be any monotonically increasing function of the correct token posterior predicted by the teacher
, while the confidence score
on the one-hot ground truth can be any monotonically increasing function of
as follow
where both and
are any monotonically increasing functions on the interval [0, 1]. In this work, we simply assume that
and
are both power functions of the same form, i.e.,
. Note that
equals to
when
.In AT/S, a linear combination of the teacher’s soft posteriors and the one-hot ground truth weighted by
and
, respectively, is used as the training target for the student AED at each decoder step. The AT/S loss function is formulated as
The steps of AT/S learning for supervised domain adapta- tion of AED model are summarized as follows:
1. Perform token-level unsupervised T/S adaptation by following the steps in Section 3 as the initialization.
2. Forward-propagate the parallel source and target domain data and
through the teacher and student
AEDs, generate teacher and student’s soft posteriors and
U for each decoder step by Eqs. (2) and (3).
3. Compute the confidence scores and
for teacher’s soft posteriors and one-hot vector of ground truth
by Eqs. (10) and (11), compute the adaptive weight
by Eq. (9).
4. Compute error signal of the AT/S loss function in Eq. (12) , back-propagate the error through student AED and update the parameters of the student AED.
5. Repeat Steps 2 to 4 until convergence.
AT/S is superior to IT/S in that the combination weights for soft and one-hot labels at each decoder step are adaptively assigned according to the confidence score on both labels. AT/S will degenerate to IT/S if the combination weights are fixed globally. Compared to CT/S, in AT/S, the student always adaptively learns from both the teacher’s soft posteriors and the one-hot ground truth rather than choosing either of them depending on the correctness of teacher’s prediction.
We adapt a close-talk AED model to the far-field data through various T/S learning methods with parallel close-talk and far-field Microsoft Cortana data for E2E ASR.
5.1. Data Preparation
For both training and adaptation, close-talk data consisting of 3400 hours of Microsoft live US English Cortana utterances are collected through a number of deployed speech services including voice search and SMD. We simulate 3400 hours of far-field Microsoft Cortana data by convolving the close-talk signal with different room impulse responses and adding various environmental noise for both training and adaptation. The 3400 hours far-field data is parallel with the 3400 hours close-talk data. We collect 17.5k far-field utterances (about 19 hours) from Harman Kardon (HK) speaker as the test set.
80-dimensional log Mel filter bank features are extracted from the training, adaptation and test speech every 10 ms over a 25 ms window. We stack 3 consecutive frames and stride the stacked frame by 30 ms, to form a sequence of 240-dimensional input speech frames. We first generate 34k mixed-units consisting of words and multi-letter units as in [32] based on the training transcription and then tokenize the training, adaptation transcriptions correspondingly. We insert a special token <space> between every two adjacent words to indicate the word boundary and add <sos>, <eos> to the beginning and end of each utterance, respectively.
5.2. AED Baseline System
We first train an AED model predicting 34k mixed units with 3400 hours close-talk training data and it ground-truth labels for E2E ASR as in [33, 34, 35]. The encoder is a bi-directional gated recurrent units (GRU)-recurrent neural network (RNN) [28, 36] with 6 hidden layers, each with 512 hidden units. We use GRU instead of long short-term memory (LSTM) [37, 38] for RNN because it has less parameters and is trained faster than LSTM with no loss of performance. Layer normalization [39] is applied for each encoder hidden layer. Each mixed unit is represented as a 512-dimensional embedding vector. The decoder is a uni-directional GRU-RNN with 2 hidden layers, each with 512 hidden units. The 34k-dimensional output layer of the decoder predicts the posteriors of all the mixed units in the vocabulary. During training, scheduled sampling [40] is applied to the decoder with a sampling probability starting at 0.0 and gradually increasing to 0.4 [30]. Dropout [41] with a probability of 0.1 is used in both encoder and decoder. A label-smoothed cross-entropy [42] loss is minimized during training. Greedy decoding is performed to generate the ASR transcription. We use PyTorch [43] toolkit for the experiments. Table 1 shows that the close-talk AED model achieves 7.58% and 17.39% WERs on a close-talk Cortana test set used in [34] and the far-field HK speaker test set, respectively.
Using the well-trained close-talk AED as the initialization, we then train a far-field AED with 3400 hours far-field data and its ground-truth labels by following the same procedure. When evaluated on the HK speaker test set, the baseline far-field AED achieves 13.93% WER for ASR as in Table 1.
Table 1. The ASR WER (%) of far-field AEDs trained with CE and AED models adapted by various T/S learning methods to 3400 hours far-field Microsoft Cortana data for E2E ASR on HK speaker test set. “Seq T/S” stands for sequence-level T/S and WERR (%) represents relative WER reduction.
5.3. Unsupervised Domain Adaptation with T/S Learning
We adapt the close-talk baseline AED to the 3400 hours far-field data using token and sequence level T/S learning as discussed in Section 3. To achieve this, we feed the 3400 hours close-talk adaptation data as the input to the teacher AED and the 3400 hours parallel far-field adaptation data as the input to the student AED. The student AED conditions its decoder on one-best token sequences generated by the teacher AED through greedy decoding. In token-level T/S, the soft posteriors generated by the teacher serve as the training targets of the student while in sequence-level T/S, the one-best sequences decoded by the teacher are used the targets.
As shown in Table 1, the token-level T/S achieves 13.06% WER on HK speaker test set, which is 24.9% and 6.25% relative improvements over the close-talk and far-field AED models, respectively. The sequence-level T/S achieves 14.00% WER, which is 19.5% relative improvement over the close-talk AED model. The sequence-level T/S performs slightly worse than the far-field AED trained with ground-truth labels because the one-best decoding from the teacher AED is not always reliable to serve as the training targets for the student model. The sequence-level T/S can be improved by using multiple decoded hypotheses generated by the teacher AED as the training targets as in [26, 27]. We did not perform Nbest decoding because it will drastically increase the computational cost and will consumes much more adaptation time than the other T/S methods. The 6.7% relative WER gain obtained by token-level T/S over sequence-level T/S shows the benefit of using soft posteriors generated by the teacher AED as the training target at each decoder step when a reliable ground-truth transcription is not available.
The 6.3% relative WER gain of token T/S over far-field AED baseline shows that the unsupervised T/S learning with no ground-truth labels can significantly outperform the supervised domain adaptation with such information available. Compared to the one-hot labels, the soft posteriors accurately models the inherent statistical relationships among different token classes in addition to the token identity encoded by a one-hot vector. It proves to be a more powerful target for the student to learn from which is consistent with what was observed in [18, 19, 20, 21, 22].
5.4. Supervised Domain Adaptation with AT/S Learning
As discussed in Section 4, we want to further improve the T/S learning by using one-hot ground-truth labels when they are available. As in [23], we perform IT/S learning for supervised domain adaptation by using the linear interpolation of soft posterior and one-hot ground truth as the training target of the student. The interpolation weights are globally fixed at 0.5 and 0.5 for all decoder steps. By following [21], we also conduct CT/S for supervised domain adaptation where soft posteriors are used as the training target of the student if the teacher’s prediction is correct at the current decoder step, otherwise the one-hot ground truth is used as the target. Finally, AT/S domain adaptation is performed by adaptively adjusting the weights assigned to the soft and one-hot labels at each decoder step as in Eqs. (9) to (11). We explore using different power functions as and
to compute the confidence scores by adjusting
. For all the above supervised T/S learning methods, the 3400 hours close-talk and 3400 hours far-field parallel adaptation data is fed as the input to the teacher and student AEDs, respectively.
As shown in Table 1, IT/S with w = 0.2 achieves 13.95% WER on HK speaker test set which is 25.5%, 7.0% and 0.8% relative improvements over the close-talk, far-field and token-level T/S adapted AED models, respectively. With a 12.82% WER, CT/S relatively improves the close-talk, far-field and token-level T/S adapted AED models by 26.3%, 8.0% and 1.8% respectively. Among different s for AT/S, the best WER is 12.49%, which is 28.2%, 10.3% and 4.4% relative gains over close-talk, far-field and token-level T/S adapted AEDs. The minimum WER is reached when
and
. Compared to
, AT/S works better for
when confidence scores
are both concave functions of the correct token posterior and the sum of incorrect token posteriors, respectively. All the IT/S, CT/S and AT/S outperform the unsupervised T/S learning indicating that the one-hot ground truth can further improve T/S domain adaptation when it is properly used. AT/S achieves the largest gain in supervised domain adaptation methods showing the superiority of adaptively extracting useful knowledge from both the soft and one-hot labels depending on their confidence scores.
In this paper, we extend T/S learning to unsupervised domain adaptation of AED models for E2E ASR. T/S learning requires only unlabeled parallel source and target domain data as the input to the teacher and student AEDs, respectively. In T/S, the student AED conditions its decoder on the one-best token sequences generated by the teacher. The teacher’s soft posteriors and decoded one-hot tokens are used as the training target of the student AED for token-level and sequence-level T/S learning, respectively.
For supervised domain adaption, we propose adaptive T/S learning in which the student always learns from a linear combination of the teacher’s soft posteriors and the one-hot ground truth. The combination weights are adaptively computed at each decoder step based on the confidence scores on both knowledge sources.
Domain adaptation is conducted on 3400 hours close-talk and 3400 hours far-field Microsoft Cortana data. Token-level T/S achieves 6.3% relative WER improvement over the baseline far-field AED model trained with CE criterion. By making use of the ground-truth labels, AT/S further improves the token-level T/S by 4.4% relative and achieves a total 10.3% relative gain over the far-field AED. AT/S also consistently outperforms IT/S and CT/S showing the advantage of learning from both the teacher and the ground truth as well as the adaptive adjustment of the combination weights.
[1] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML. ACM, 2006.
[2] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International Conference on Machine Learning, 2014, pp. 1764–1772.
[3] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[4] J. K. Chorowski, D. Bahdanau, D. Serdyuk et al., “Attention-based models for speech recognition,” in NIPS, 2015, pp. 577–585.
[5] D. Bahdanau, J. Chorowski, D. Serdyuk et al., “End-to-end attention-based large vocabulary speech recognition,” in Proc. ICASSP. IEEE, 2016, pp. 4945–4949.
[6] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, at- tend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP. IEEE, 2016, pp. 4960–4964.
[7] S. Sun, B. Zhang, L. Xie et al., “An unsupervised deep domain adaptation approach for robust speech recognition,” Neurocomputing, vol. 257, pp. 79 – 87, 2017.
[8] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Un- supervised adaptation with domain separation networks for robust speech recognition,” in Proc. ASRU, 2017.
[9] Z. Meng, J. Li, and Y. Gong, “Adversarial speaker adap- tation,” in Proc. ICASSP, 2019.
[10] ——, “Cycle-consistent speech enhancement,” Interspeech, 2018.
[11] ——, “Adversarial feature-mapping for speech en- hancement,” Interspeech, 2018.
[12] Z. Meng, J. Li, Y. Gong, and B.-H. Juang, “Adversarial teacher-student learning for unsupervised domain adaptation,” in Proc. ICASSP. IEEE, 2018, pp. 5949–5953.
[13] Y. Shinohara, “Adversarial multi-task learning of deep neural networks for robust speech recognition.” in INTERSPEECH, 2016, pp. 2369–2372.
[14] Z. Meng, J. Li, and Y. Gong, “Attentive adversar- ial learning for domain-invariant training,” in Proc. ICASSP, 2019.
[15] Z. Meng, J. Li, Z. Chen et al., “Speaker-invariant training via adversarial learning,” in Proc. ICASSP, 2018.
[16] G. Saon, G. Kurata, T. Sercu et al., “English conversational telephone speech recognition by humans and machines,” arXiv preprint arXiv:1703.02136, 2017.
[17] J. Li, R. Zhao, J.-T. Huang, and Y. Gong, “Learn- ing small-size DNN with output-distribution-based criteria.” in Proc. INTERSPEECH, 2014, pp. 1910–1914.
[18] J. Li, M. L. Seltzer, X. Wang et al., “Large-scale domain adaptation via teacher-student learning,” in Proc. INTERSPEECH, 2017.
[19] S. Watanabe, T. Hori, J. L. Roux, and J. Hershey, “Student- teacher network learning with enhanced features,” in Proc. ICASSP, 2017.
[20] J. Li, R. Zhao, Z. Chen et al., “Developing far-field speaker system via teacher-student learning,” in Proc. ICASSP, 2018.
[21] Z. Meng, J. Li, Y. Zhao, and Y. Gong, “Conditional teacher-student learning,” in Proc. ICASSP, 2019.
[22] T. Asami, R. Masumura, Y. Yamaguchi, H. Masataki, and Y. Aono, “Domain adaptation of dnn acoustic models using knowledge distillation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5185–5189.
[23] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[24] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third CHiME speech separation and recognition challenge: Dataset, task and baselines,” in Proc. ASRU, 2015, pp. 504–511.
[25] Y. Kim and A. M. Rush, “Sequence-level knowledge distillation,” EMNLP, 2016.
[26] R. M. Munim, N. Inoue, and K. Shinoda, “Sequence- level knowledge distillation for model compression of attention-based sequence-to-sequence speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6151–6155.
[27] R. Pang, T. Sainath, R. Prabhavalkar et al., “Compression of end-to-end models,” in Proc. Interspeech 2018, 2018, pp. 27–31.
[28] K. Cho, B. Van Merri¨enboer, D. Bahdanau, and Y. Ben- gio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014.
[29] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[30] C.-C. Chiu, T. N. Sainath, Y. Wu et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP. IEEE, 2018, pp. 4774–4778.
[31] J. H. Wong and M. J. Gales, “Sequence student-teacher training of deep neural networks,” 2016.
[32] J. Li, G. Ye, A. Das et al., “Advancing acoustic-to-word ctc model,” in Proc. ICASSP. IEEE, 2018, pp. 5794– 5798.
[33] Z. Meng, Y. Gaur, J. Li, and Y. Gong, “Character-aware attention-based end-to-end speech recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019.
[34] ——, “Speaker adaptation for attention-based end-to- end speech recognition,” Proc. Interspeech, 2019.
[35] Y. Gaur, J. Li, Z. Meng, and Y. Gong, “Acoustic-to- phrase end-to-end speech recognition,” in submitted to INTERSPEECH 2019. IEEE, 2019.
[36] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Em- pirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[37] H. Erdogan, T. Hayashi, J. R. Hershey et al., “Multichannel speech recognition: Lstms all the way through,” in CHiME-4 workshop, 2016, pp. 1–4.
[38] Z. Meng, S. Watanabe, J. R. Hershey et al., “Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition,” in ICASSP. IEEE, 2017, pp. 271–275.
[39] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal- ization,” arXiv preprint arXiv:1607.06450, 2016.
[40] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Sched- uled sampling for sequence prediction with recurrent neural networks,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 1171–1179.
[41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929– 1958, 2014.
[42] J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” CoRR, vol. abs/1613.02695, 2016. [Online]. Available: http://arxiv.org/abs/1612.02695
[43] A. Paszke, S. Gross, S. Chintala et al., “Automatic differentiation in pytorch,” 2017.