Keyword detection has become an important frontend service for ASR-based assistant interfaces (e.g. Hey Google, Alexa, Hey Siri). As assistant technology spreads to more ubiquitous use-cases (mobile, IOT), reducing resource consumption (memory and computation) while improving accuracy has been the key success criteria of keyword spotting techniques.
Following the successes in general ASR [2, 3], the neural network based approach has been extensively explored in keyword spotting area with benefits of lowering resource requirements and improving accuracy [4, 5, 6, 7, 8, 9, 10, 11]. Such works include DNN + temporal integration [4, 5, 11, 12], and HMM + DNN hybrid approaches [6, 7, 8, 9, 10]. Recently introduced end-to-end trainable DNN approaches [1, 13] further improved accuracy and lowered resource requirements using highly optimizable system design.
In general, training of such DNN based systems required frame-level labels generated by LVCSR systems [14, 1]. These approaches make end-to-end optimizable keyword spotting system depend on labels generated from non-end-to-end system trained for a different task. However, for keyword-spotting, the exact position of the keyword is not as relevant as its presence. Therefore, such strict dependency on frame-level labels may limit further optimization promised by the end-to-end approach. In [1], the top level loss is derived by integrating frame-level losses, which are computed using frame-level labels from LVCSR. Integrating frame-level losses penalizes slightly mis-aligned correct predictions, which can limit detection accuracy, especially for difficult data (e.g. noisy or accented speech) where LVCSR labels may have higher-than-normal uncertainty. In such case, losses can be fully minimized only when the predicted value and position-in-time matches that of provided frame level labels, where exact position match is not highly relevant for high accuracy.
Prior work of CTC-training [15] or sequence-to-sequence training [16] dont require on frame level alignment information. However
Fig. 1. End-to-end topology trained to predict the keyword likeli- hood score. Bottleneck layers reduce parameters and computation. The intermediate softmax is used in encoder+decoder training only.
those approaches need to train full-sized encoders, which require fully transcribed speech data. Work in [17] proposed max pooling loss, which doesnt depend on phoneme level alignment information, but its application is limited to decoder level training.
In this paper, we prepose a new smoothed max pooling loss for training an end-to-end keyword spotting system. The new loss function reduces dependence on dense labels from LVCSR. Further, the new loss function jointly trains an encoder (detecting keyword parts) and decoder (detecting whole keyword). One can train models to generate stable activations for a target pattern even without exact location of the target specified. We describe the details of the proposed method in Section 2. Then we show experiment setup in Section 3, and results in Section 4. We conclude with discussions in Section 5.
The proposed model uses the same encoder/decoder structure as [1] (Fig.1), but it differs in that encoder and decoder models are trained simultaneously using smoothed max pooling loss. In [1], both encoder and decoder models are trained with cross entropy (CE) loss using frame level labels. In the proposed approach, we define losses for encoder and decoder using smoothed max pooling loss, and optimize the combination of two losses simultaneously. The proposed smoothed max pooling loss doesnt strictly depend on phoneme-level alignment, allowing better optimization than the baseline.
2.1. Baseline end-to-end keyword spotting model
Both the baseline and the proposed model have an encoder which takes spectral domain feature as input and generate (K+1) outputs
corresponding to phoneme-like sound units (Fig.1). The decoder model takes the encoder output as input and generates binary output
that predicts existence of a keyword . The model is fed with acoustic input features at each frame (generated every 10ms), and generates prediction labels at each frame in a streaming manner. In [1], the encoder model is trained first, and then the decoder model is trained while encoder model weights are frozen.
In [1], the encoder model is trained to predict phoneme-level labels provided from LVCSR. Both encoder and decoder models use CE-loss defined in Eq (1) and (2), where is spectral feature of d-dimension,
stands for ith dimension of network’s softmax output, W is network weight, and
is a frame-level label at frame t.
In [1], target label sequence consists of intervals of repeated labels which we call runs. These label runs define clearly defined intervals where a model should learn to generate strong activation in label output dimension. While such model behavior can be trained end-to-end, the labels need to be provided from a LVCSR system which is typically non-end-to-end system [2]. The timing and accuracy of labels from LVCSR system can limit the accuracy of the trained model.
2.2. Smoothed Max Pooling Loss
Instead of interval based CE-loss, we propose temporal max pooling operation to avoid specifying exact activation position (timing) from supervised labels. We also propose to apply temporal smoothing on the logits of frames before max pooling operation. [17] also explores max pooling loss, where one specifies a window of max pooling in the time domain, and computes CE loss only with the logit of the frame with maximum activation. However, with such simple max pooling loss, the learned activation tends to resemble a delta function, whose peak values tend to be unstable under small variation and temporal shift of audio. By introducing temporal smoothing on logits before max pooling, the model learns temporally smooth activation and stable peak values. Eqs.(3) to (7) define the smoothed max pooling loss.
Where s(t) is a smoothing filter, is a convolution over timeand
defines the interval of ith max pooling window.
is a set of frames not included in any of the max pooling windows.
2.2.1. Smoothed Max Pooling Loss for Decoder
In our proposed approach, the decoder submodel is trained to generate strong activation on output dimension 1 near end of keyword.
define the loss for the decoder submodel.
Fig. 2. Relationship between max pooling windows and keyword endpoint. (a) row shows an example of keyword audio length and endpoint Wend. (b) row shows encoder max pooling windows and expected activations. (c) row shows decoder max pooling window and expected activation. (d) row shows length of observable context for encoder and decoder.
Where offsetare tunable parameters.
is an end-point of the expected keyword interval. Due to the nature of max pooling, the max pooling loss values are not sensitive to the exact value of
as long as the window
cludes actual end-point of the keyword. By defining the interval long enough, the model can learn optimal position of strongest activation in a semi-supervised manner. For current work, we used word level alignment from [2] to get
, but it can be computed from output of existing detection model such as [1]. Fig. 2 (a) and (c) visualizes relationship between keyword and decoder pooling window.
2.2.2. Smoothed Max Pooling Loss for Encoder
Unlike [17] where only decoder level output is trained with max pooling, we propose training encoder level output also using smoothed max pooling. In our method, encoder model learns a sequence of sound-parts that constitute a keyword in a semi-supervised manner. This can be done by placing K max-pooling windows sequentially over expected keyword location and define a max pooling loss at each window (Fig. 2(b)). The number of windows (n=K) and are tuned such that K approximates number of distinguishable sound parts (i.e. phonemes), and
the average length of the keyword. Eqs.(3) to (7) and (10) to (11) define such encoder loss.
τ
Where offsetare also tunable parameters. Fig.2 (a) and 2 (b) show the relationship between expected keyword and pooling windows. Both the encoder and the decoder models are trained jointly using loss in (12). The tunable parameter
the relative importance of each loss.
We compare the model trained with the new smoothed max pool- ing loss on encoder/decoder architecture with the baseline in [1]. Both the baseline and the proposed model have the same architecture. Only the training losses are different. Details of the setup are discussed below.
3.1. Front-end
We used the same frontend feature extract as the baseline [1] in our experiments. The front-end extracts and stacks a 40-d feature vector of log-mel filter-bank energies at each frame and stacks them to generate input feature vector . Refer to [1] for further details.
3.2. Model setup
We selected E2E 318K architecture in [1] as the baseline and use the same structure for testing all other models. As shown in Fig. 1, the model has 7 SVDF layers and 3 linear bottleneck dense layers. For detailed architectural parameters, please refer to [1]. We call the baseline model as Baseline CE CE where encoder and decoder submodels are trained with CE loss. We call the proposed model as Max4 SMP SMP where both encoder and decoder submodels are trained by SMP (smoothed max pooling) loss.
We also performed ablation study by testing other models that use different losses. Table 1 summarizes all the tested models. Model Max1–Max3 uses SMP (smoothed max pooling) loss for the decoder, but uses different losses for the encoder. Max1 CTC SMP used CTC loss to train the encoder. Standard CTC loss function from Tensorflow [18] was used. CTC loss doesnt need alignments, but it learns peaky activations whose peak values are not highly stable. Max2 NA SMP has no encoder loss (i.e. ), s.t. the entire network is trained by decoder loss only. Max3 CE SMP used baseline CE loss for encoder. Model Max4–Max7 are tested to measure the importance of the smoothing operation. MP means max pooling without smoothing (i.e. s(t) = 1).
Table 1. Summary of various models tested Models encoder loss decoder loss
For the decoder SMP(smoothed max pooling) loss, we used truncated Gaussian as the smoothing filter frames (90ms) and truncated length 21 frames. Max pooling window of size 60 frames (600ms) with offset
frames (400ms) is used. For the encoder SMP loss, we used truncated gaussian with
frames and truncated length 9. Encoder max pooling windows have size of 20 frames with offset
frames. These windows are placed sequentially in 40 frames interval.
3.3. Dataset
The training data consists of 2.1 million anonymized utterances with the keywords Ok Google and Hey Google. Data augmentation similar to [1] has been used for better robustness.
Evaluation is done on four data sets separate from training data, representing diverse environmental conditions – Clean non-accented set contains 170K non-accented English utterances of keywords in quiet condition. Clean accented has 138K English utterances of keyword with Australian, British, and Indian accents in quiet conditions. Query logs contains 58K utterances from anonymized voice search queries. In-vehicle set has 144K utterances with the keywords recorded inside cars while driving, which includes significant amount of noises from road, engine, and fans. All sets are augmented with 64K negative utterances, which are re-recorded TV noise.
To show effectiveness of the proposed approach, we evaluated false- reject (FR) and false-accept (FA) tradeoff across various models described in Section 3. All models are converted to inference models using TensorFlow Lites quantization [19].
Table 2 summarizes FR rates of models in Fig.3 and 4 at selected FA rate (0.1 FA per hour measured on 64K re-recorded TV noise set). Fig.3 shows the ROC curves of various models (baseline, Max1– Max4) across different conditions. Figure 4 shows the ROC curves of Max4–Max7 models across different conditions.
Across model types and evaluation conditions Max4 SMP SMP shows the best accuracy and ROC curve. Max3 CE MP model also performs better than the baseline but not as good as Max4. Other variations Max2 (has only decoder loss) and Max1 (has CTC encoder loss) performed worse than baseline. Comparison among models with max pooling and different smoothing options (Fig.4) shows that Max4 SMP SMP (smoothed max poling on both encoder and decoder) performs the best and outperforms Max7(no smoothing on encoder and decoder max pooling loss). Especially the proposed Max4 model reduces FR rate to nearly half of the baseline in clean accented and noisy inside-vehicle conditions, where it’s more difficult to obtain training data with accurate alignments.
Table 2. FR rate of models with various loss types at 0.1 FA/h
We presented smoothed max pooling loss for training keyword spot- ting model with improved optimizability. Experiments show that the proposed approach outperforms the baseline model with CE loss by relative 22%–54% across a variety of conditions. Further, we show that applying smoothing before max pooling is highly important for achieving accuracy better than the baseline. The proposed approach provides further benefits of reducing dependence on LVCSR to provide phoneme level alignments, which is desirable for embedded learning scenarios, like on-device learning [20][21].
Fig. 3. ROC curves of models with various loss types and conditions
Fig. 4. ROC curves of models with various smoothing options
[1] Raziel Alvarez and Hyun Jin Park, “End-to-end Streaming Keyword Spotting,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6336–6340, 2019.
[2] N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “Appli- cation of pretrained deep neural networks to large vocabulary speech recognition,” in Proceedings of Interspeech 2012, 2012.
[3] Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian Mc- Graw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-Yiin Chang, Kanishka Rao, and Alexander Gruenstein, “Streaming end-to-end speech recognition for mobile devices,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385, 2018.
[4] R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. Sainath, “Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4704– 4708.
[5] Siri Team, “Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant,” https://machinelearning.apple.com/2017/ 10/01/hey-siri.html, 2017, Accessed: 2018-10-06.
[6] S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Man- dal, B. Hoffmeister, and S. Vitaladevuni, “Multi-task learning and weighted cross-entropy for DNN-based keyword spotting,” in INTERSPEECH, 2016.
[7] M. Sun, D. Snyder, Y. Gao, V. Nagaraja, M. Rodehorst, S. Pan- chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “Compressed time delay neural network for small-footprint keyword spotting,” in INTERSPEECH, 2017.
[8] K. Kumatani, S. Panchapagesan, M. Wu, M. Kim, N. Strom, G. Tiwari, and A. Mandal, “Direct modeling of raw audio with DNNs for wake word detection,” 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 252– 257, 2017.
[9] J. Guo, K. Kumatani, M. Sun, M. Wu, A. Raju, N. Strom, and A. Mandal, “Time-delayed bottleneck highway networks using a DFT feature for keyword spotting,” in ICASSP. 2018, pp. 5489–5493, IEEE.
[10] M. Wu, S. Panchapagesan, M. Sun, J. Gu, R. Thomas, S.N.P. Vitaladevuni, B. Hoffmeister, and A. Mandal, “Monophonebased background modeling for two-stage on-device wake word detection,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5494–5498.
[11] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4087–4091.
[12] A. Gruenstein, R. Alvarez, C. Thornton, and M. Ghodrat, “A cascade architecture for keyword spotting on mobile devices,” in 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.
[13] Hanna Mazzawi, Xavi Gonzalvo, Aleks Kracun, Prashant Srid- har, Niranjan Subrahmanya, Ignacio Lopez Moreno, Hyun Jin Park, and Patrick Violette, “Improving keyword spotting and language identification via neural architecture search at scale,” in INTERSPEECH 2019, 2019.
[14] T. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting.,” in Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), 2015, pp. 1478–1482.
[15] Z. Wang, X. Li, and J. Zhou, “Small-footprint keyword spot- ting using deep neural network and connectionist temporal classifier,” 09 2017.
[16] Y. He, R. Prabhavalkar, Rao. K., W. Li, A. Bakhtin, and I. McGraw, “Streaming small-footprint keyword spotting using sequence-to-sequence models,” 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 474– 481, 2017.
[17] Ming Sun, Anirudh Raju, George Tucker, Sankaran Pan- chapagesan, Gengshen Fu, Arindam Mandal, Spyridon Matsoukas, Nikko Strom, and Shiv Vitaladevuni, “Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting,” 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 474–480, 2016.
[18] Alex Graves, Santiago Fern´andez, Faustino J. Gomez, and J¨urgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
[19] R. Alvarez, R. Krishnamoorthi, S. Sivakumar, Y. Li, A. Chiao, P Warden, S. Shekhar, S. Sirajuddin, and Davis. T., “Introducing the Model Optimization Toolkit for TensorFlow,” https://medium.com/tensorflow/introducing-the-model-optimization-toolkit-for-tensorflow-254aca1ba0a3, Accessed: 2019-02-17.
[20] Brendan McMahan and Daniel Ramage, “Federated learning: Collaborative machine learning without centralized training data,” 04 2017.
[21] David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gissel- brecht, and Joseph Dureau, “Federated learning for keyword spotting,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6341–6345, 2018.