b

DiscoverSearch
About
My stuff
From Audio to Semantics: Approaches to end-to-end spoken language understanding
2018·arXiv
ABSTRACT
ABSTRACT

Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to a transcript, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of domains, intents, and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem [1]. We propose and compare various encoder-decoder based approaches that optimize both modules jointly, in an end-to-end manner. Evaluations on a real-world task show that 1) having an intermediate text representation is crucial for the quality of the predicted semantics, especially the intent arguments and 2) jointly optimizing the full system improves overall accuracy of prediction. Compared to independently trained models, our best jointly trained model achieves similar domain and intent prediction F1 scores, but improves argument word error rate by 18% relative.

Index Termsspoken language understanding, sequence-to-sequence, end-to-end training, multi-task learning, speech recognition

Understanding semantics from a user input or a query is central to any human computer interface (HCI) that aims to interact naturally with users. Spoken dialogue systems that aim to solve this for spe-cific tasks have been a focus of research for more than two decades [2]. With the widespread adoption of smart devices like GoogleHome [3], Amazon Alexa, Apple Siri and Microsoft Cortana, spoken language understanding (SLU) is moving to the forefront of HCI.

Typically, SLU involves multiple modules. An automatic speech recognition system (ASR) first transcribes the user query into a transcript. This is then fed to a module that does natural language understanding (NLU)1. NLU itself involves domain classification, intent detection, and slot filling [2]. In traditional NLU systems, first the high level domain of a transcript is identified. Subsequently, intent detection and slot filling are performed according to the predicted domain’s semantic template. Intent detection identifies the finer-grained intent class a given transcript belongs to. Slot filling, or argument prediction,2 is the task of extracting semantic components, like the argument values corresponding to the domain. Figure 1 shows example transcripts and their corresponding domain, intent, and arguments. Recent work [4, 5] has shown that jointly optimizing these three tasks improves the overall quality of the NLU component. For conciseness, we use the word semantics to refer to all three of domain, intent and arguments.

image

Fig. 1: Example transcripts and their corresponding domain, intent, and arguments. Only arguments that have corresponding values in a transcript are shown. For example, song name and station name are both arguments in the MEDIA domain but only one has a corresponding value in each of the MEDIA examples.

Even though user interactions in an SLU system start as a voice query, most NLU systems assume that the transcript of the request is available or obtained independently. The NLU module is typically optimized independent of ASR. While accuracy of ASR systems have improved over the years [6, 7], errors in recognition worsen NLU performance. This problem gets exacerbated on smart devices, where interactions tend to be more conversational. However, not all ASR errors are equally bad for NLU. For most applications, the semantics consist of an action with relevant arguments; a large part of the transcript of the ASR module has no impact on the end result as long as intent classification and predicted arguments are accurate. For example, for a user query, “Set an alarm at two o’clock,” intent, “alarm,” and its arguments, ‘two o’clock’, are more important than filler words, like ‘an’. Joint optimization can focus on improving those aspects of transcription accuracy that are aligned with the end goal, whereas independent optimization fails at that objective. Furthermore, for some applications, there are intents that are more naturally predicted from audio compared to transcript. For example, when training an automated assistant, like Google Duplex [8] or an airline travel assistant, it would be useful to identify acoustic events like background noise, music and other non-verbal cues as special intents, and tailor the assistant’s response accordingly to improve the overall user experience. Hence, training various components of

the SLU system jointly can be advantageous.

There have been some early attempts at using audio to perform NLU. Domain and intent are predicted directly from audio in [9], and this approach is shown to perform competitively, but worse than predicting from transcript. Alternatively, using multiple ASR hypothesis [10] or the word confusion network [11] or the recognition lattice [12] have been proposed to account for ASR errors, but independent optimization of ASR and NLU can still lead to sub-optimal performance. In [13], an ASR correcting module is trained jointly with NLU component. To account for ASR errors, multiple ASR hypotheses are generated during training as additional input sequences, which are then error-corrected by the slot-filling model. While the slot-filling module is trained to account for the errors, the ASR module is still trained independent of NLU. Similar to the work in [9], an end-to-end system is proposed in [14] that does intent classification directly from speech, with an intermediate ASR task. But unlike the current work, it uses a connectionist temporal classification (CTC) [15] acoustic model, and only performs intent prediction. In the current work, we show that NLU, i.e., domain, intent, and argument prediction can be done jointly with ASR starting directly from audio and with a quality of performance that matches or surpasses an independently trained counterpart.

The systems presented in this study are motivated by the encoder-decoder based sequence-to-sequence (Seq2Seq) [16, 17, 1] approach that has shown to perform well for machine translation [18] and speech recognition tasks [19]. Encoder-decoder based approaches provide an attractive framework to implementing SLU systems, since the attention mechanism allows for jointly learning an alignment while predicting a target sequence that has a many-to-one relationship with its input [20]. Such techniques have already been used in NLU [21], but using ASR transcripts, not audio, as input to the system.

In this work, we present and compare various end-to-end approaches to SLU for joinlty predicting semantics from audio. The presented techniques simplify the overall architecture of SLU systems. Using a large training set comparable to what is typically used for building large-vocabulary ASR systems, we show that not only can predicting semantics from audio be competitive, it can in some conditions outperform the conventional two-stage approach. To the best of our knowledge, this is the first study that shows all three of domain, intent, and arguments can be predicted from audio with competitive results.

The rest of the paper is organized as follows. Section 2 presents various models and architectures explored in this work. The experimental setup and results are described in Section 3 and Section 4, respectively. We conclude in Section 5.

Our work is based on the encoder-decoder framework augmented by attention. We start by reviewing this framework in Section 2.2. There are multiple ways to model an end-to-end SLU system. One can either predict semantics directly from audio, ignoring the transcript, or have separate modules for predicting the transcript and semantics that are optimized jointly. These different approaches and the corresponding formulation are described in Sections 2.4 2.6. Figure 2 shows a schematic representation of these architectures.

2.1. Notation

To review the general encoder-decoder framework, we denote the input and output sequences by  A = {a1, . . . , aK} and B =

image

a. Direct model

image

b. Joint model

image

c. Multitask model

image

d. Multistage model

Fig. 2: Different model architectures investigated in this paper. X stands for acoustic features, W for transcripts and S for semantics (domain, intent and arguments). Dotted lines represent both the conditioning of output label on its history and the attention module, which is treated as a part of the decoder.

{b1, . . . , bL}, where K and Ldenote their lengths. In this work, since we start from audio, the input sequence to the model are acoustic features (we describe the details of acoustic feature computation in Section 3.3), while the output sequence, depending on the model architecture, may be the transcript, the corresponding semantics, or both. While the semantics of an utterance is best represented as structured data, we use a simple deterministic scheme for serializing it by first including the domain and intent, followed by the argument labels and their values (see Table 1). More details are described in Section 3.2. For the rest of the paper, we denote the input acoustic features by  X = {x1, . . . , xT }, where Tstands for the total number of time frames. The transcript is represented as a sequence of graphemes. It is denoted as  W = {w1, . . . , wN}, where Nstands for the number of graphemes in the transcript. The semantics sequence is represented by  S = {s1, . . . , sM}, where Mstands for the number of tokens. The tokens come from a dictionary consisting of the domain, intent, argument labels, and graphemes to represent the argument values.

2.2. Encoder-decoder framework

Given the training pair (A, B) and model parameters  θ, a sequence-to-sequence model computes the conditional probability  P(B|A; θ).This can be done by estimating the terms of the probability using chain rule:

image

Table 1: Example transcripts and their corresponding serialized semantics.

image

The parameters of the model are learned by maximizing the conditional probabilities for the training data:

image

In the encoder-decoder framework [16, 1], the model is parameterized as a neural network, most commonly a recurrent neural network, consisting of two main parts: An encoder that receives the input sequence and encodes it into a higher level representation, and a decoder that generates the output from this representation after first being fed a special start-of-sequence symbol. Decoding terminates when the decoder emits the special end-of-sequence symbol. The modeling power of encoder-decoder framework has been improved by the addition of an attention mechanism [17]. This mechanism was introduced to overcome the bottleneck of having to encode the entire variable length input sequence in a single vector. At each output step, the decoder’s last hidden state is used to generate an attention vector over the entire encoded input sequence, which is used to summarize and propagate the needed information from the encoder to the decoder at every output step. In this work, we use multi-headed attention [22] that allows the decoder to focus on multiple parts of the input when generating each output. The effectiveness of this type of attention for ASR was explored and verified in [23].

2.3. Direct model

In the direct model the semantics of an utterance are directly predicted from the audio. The model does not learn to fully transcribe the input audio; it learns to only transcribe parts of the transcript that appear as argument values. Conceptually, this is the simplest formulation for end-to-end semantics prediction. But it also makes the task challenging, since the model has to implicitly learn to ignore parts of the transcript that is not part of an argument and the corresponding audio, while also inferring the domain and intent in the process.

Following the notation introduced in Section 2.2, the model directly computes  P(S|X; θ), as in Equation 1. The encoder takes the acoustic features, X, as input and the decoder generates the semantic sequence, S.

2.4. Joint model

This model still consists of an encoder and a decoder, similar to the direct model, but the decoder generates the transcript followed by domain, intent, and arguments. The output of this model is thus the concatenation of transcript and its corresponding semantics: [W : S] where [:] denotes concatenation of the first and the second sequence.

This formulation conditions intent and argument prediction on the transcript:

image

This model retains the simplicity of the direct model, while simultaneously making learning easier by introducing an intermediate transcript representation corresponding to the input audio.

2.5. Multitask model

Multitask learning [24] (MTL) is a widely used technique when learning related tasks, typically with limited data. Related tasks act as inductive bias, improving generalization of the main task by choosing parameters that are optimal for all tasks. Although predicting the text transcript is not necessary for domain, intent and argument prediction, it is a natural secondary task that can potentially offer a strong inductive bias while learning. In MTL, we factorize P(S, W|X; θ) as:

image

In the case of neural nets, multitask learning is typically done by sharing hidden representations between tasks [25]. In this work, we do this by sharing the encoder and having separate decoders for predicting transcripts and semantics. We then learn parameters that optimize both tasks:

image

where,  θ = (θe, θWd , θSd ). θe, θWd , θSd are the parameters of the shared encoder, the decoder that predicts the transcript, and the decoder that predicts semantics, respectively. The shared encoder learns representations that enable both transcript and semantics prediction.

2.6. Multistage model

Multistage (MS) model, when trained under the maximum likelihood criterion, is most similar to the conventional approach of training the ASR and NLU components independently. In MS modeling, semantics are assumed to be conditionally independent of acoustics given the transcript:

image

Given this formulation,  θcan be learned as:

image

Here,  θW , θS are, respectively, the parameters of the first stage, which predicts the transcript, and the second stage, which predicts semantics. For each training example, we assume that the triplet

Table 2: Distribution of domains considered in this study in the training and test data.

image

(X, W, S) is available. As a result, the two terms in Eq. 6 can be independently optimized, thereby reducing the model to a conventional 2-stage SLU system. In practice, however, it is possible to weakly tie the two stages together during training by using the predicted W at each time-step and allowing the gradients to pass from the second stage to the first stage through that label index. In Sec. 4, we will present results using alternative strategies to pick W from the first stage to propagate to the second stage, like the argmax of the softmax layer or sampling from the multinomial distribution induced by the softmax layer. By weakly tying the two stages, we allow the first stage to be optimized jointly with the second stage, based on the criterion that is relevant for both stages.

One of the advantages of the multistage approach is that the parameters for the 2 tasks are decoupled. Therefore, we can easily use different corpora to train each stage. Typically, the amount of data available to train a speech recognizer far exceeds the amount available to train an NLU system. In such cases, we can use the available ASR training data to tune the first stage and finally train the entire system using whatever data is available to train jointly. Furthermore, a stronger coupling between the 2 stages can be made when optimizing alternative loss criterion like the minimum Bayes risk (MBR) [26][27]. We’ll leave these aspects of multistage modeling to future work, as the focus of current study is more to understand the feasibility of predicting directly from audio and training jointly.

3.1. Data

Our training data consists of 24M anonymized English utterances transcribed by humans. Similarly, our test set consists of 16K handtranscribed utterances. Both training and testing sets represent a slice of traffic from Google Home that we are interested in. The labeling for domain, intent, and arguments is generated from passing the ground-truth transcription through context free grammars (CFG). The CFGs are used to parse and transform ground-truth transcripts to domain, intent, and arguments. We only consider nonconversational (one-shot) queries in this work. In total, there are 5 domains: MEDIA, MEDIA CONTROL, PRODUCTIVITY, DELIGHT, and NONE. As the name suggests, any utterance that cannot be classified into the first four domains is labeled NONE. We consider  ∼20 intents in this study, such as SET ALARM, SELF NOTE, etc., and two arguments: DATETIME and SUBJECT. The distribution of domains in the train and test sets is shown in Table 2.

3.2. Serializing/De-serializing Semantics

We use a simple scheme for serializing semantics: The domain is specified first using a special tag ‘<DOMAIN>’ followed by its name. If the domain is further divided into intents, we use the tag ‘<INTENT>’ followed by the intent’s name. Any optional arguments are specified similarly using the name of the argument and its corresponding value. Table 1 shows a few example transcripts and their corresponding serialized semantics.

At inference time, the predicted semantics sequence, S, is deserialized in a similar fashion to extract the domain, intent, and argument label and values. This is done using a simple parser that tokenizes the sequence by the domain tag, intent tag and argument name and treats the sequence in between them as the corresponding values. This parser is agnostic to the order of these special tags, i.e., the domain tag can come ahead of the intent tag. In the case of the joint model where the output sequence is the concatenation of the transcript and semantics, the first observed special tag or argument name marks the start of the semantic sequence.

The vocabulary that we use includes the domain and intent tags, domain, intent and argument names, (i.e., all symbols enclosed in “<” and “>” in Table 1) as well as English graphemes for representing transcript and argument values. The graphemes in this study are limited to lowercase English alphabets and digits, punctuation and a few other special symbols such as underscore, brackets, start-of-sentence, and end-of-sentence. The total size of the vocabulary is 110. Note that the special tags used for representing semantics are each a single ouput, e.g., “<DOMAIN>” is one output and not eight graphemes “<, D, ..., N,  >”.

3.3. Models

All experiments use the same acoustic features: 80-dimensional logMel filterbanks, computed with a 25 msec window, shifted every 10 msec. Similar to [28], features from 3 contiguous frames are stacked, resulting in a 240-dimensional vector. These stacked features are downsampled by a factor of 3 generating inputs at 30ms frame rate that the encoder operates on.

Table 3: Model architectures used in the experiments. In each of the Enc/Dec columns, the first number indicates the number of layers and the second number shows the number of cells per layer. The cell type in all the models is Long Short Term Memory (LSTM). The last column shows the total number of parameters (in million).

image

Table 3 summarizes the architecture of the various models used in our experiments. We maintain a similar number of parameters (within 15% difference) across models to allow for a fair comparison. All encoder and decoders use Long Short Term Memory (LSTM) [29] cells. The first encoder in all models is unidirectional, while the second encoder (in multistage models) uses bidirectional LSTMs [30]. Prior work [5] has shown that using bidirectional cells for encoding a transcript for the task of classifying its domain and intent achieves better performance compared to the unidirectional version. The first layer in all decoders is an embedding layer of size 128. The second encoder in the multistage model, which takes the transcript as input, also uses an embedding layer of the same size. All decoders use 4-headed additive attention [31, 17, 22].

Table 4: Domain and intent F1 scores, and argument WER for the predicted semantics.

image

Our Baseline is the multistage model in which the two stages that do ASR and NLU are trained independently, but using the same training data. We consider 2 variants of the multistage model that weakly couples the 2 stages. Multistage (ArgMax) passes the argmax of the softmax layer of the first stage decoder, which predicts transcripts, to the second stage. Multistage (SampledSoftmax), on the other hand, passes on an unbiased sample from multinomial distribution represented by the output of the softmax layer [32].

All neural networks are trained from scratch with the cross-entropy criterion in the TensorFlow framework [33]. We use beam search during inference with a beam size of 8. The models are trained using Tensor Processing Units [34] using the Adam optimizer [35] and synchronous gradient descent.

3.4. Evaluation Metrics

We use the typical ASR and NLU metrics for evaluation. For models that generate the transcript, we measure and report word error rate (WER). For semantics, we measure multi-class F1 scores [36] for domain and intent. NLU systems that use in-out-begin (IOB) format for tagging arguments (see [5] for an example of IOB format) report F1 scores for argument tags (e.g., [36] in the case of named-entities), but it is not clear how to measure this metric when the input transcript and the output arguments do not match, or when the input is audio. For example, if ground truth semantics contains “<DATETIME>five p.m.” but the hypothesized semantics is “<DATETIME>high p.m.”, it would be useful to have an error metric that captures the misrecognition of “five” to “high”. For that reason, we choose to report WER for the arguments, instead of the F1 scores. In our computation, we count over triggers and misses towards 100% WER. For example, if the ground truth semantics contains a DATETIME argument, but the recognized semantics does not, that instance has a 100% WER for DATETIME. We compute per argument WER and report the weighted average where each argument’s WER is weighted according to its number of occurrences.

Table 4 compares domain, intent, and argument prediction performance of the models presented in the previous section. As can be seen, all models perform relatively similarly when it comes to classifying the domains. The Joint model works the best, with an F1 score of 96.8%. Direct model, which has the lowest F1 score, is only worse by 0.6% absolute. Performance on intent prediction is slightly worse, on average, compared to domain prediction. The Multitask and Joint models achieve the best F1 scores of 95.8% and 95.7%, respectively. Both these models use the encoded acoustic features as input to the decoder, and unlike the Direct model, also predict the transcripts. This shows that having access to acoustic features and having an intermediate text representation are both

important when predicting intent.

Comparing the Baseline model with the multistage models that weakly couple the 2 stages, Multistage (ArgMax) and Multistage (SampledSoftmax), we can see that they all work very similarly when it comes to domain and intent prediction, and are generally worse than Joint and Multitask models. This further shows the importance of complimenting transcripts with acoustic features when predicting intent.

The differences in argument WER is more pronounced among the different models. Direct model performs the worst, getting a WER of 18.2. This shows that including transcription loss while training end-to-end models can help improve argument prediction. Contrary to domain and intent F1 scores, Multistage (ArgMax) and Multistage (SampledSoftmax), work better than the Joint and Multitask models. Nevertheless, all jointly optimized models work better than the independently trained baseline. Notably, Multistage (SampledSoftmax) model improves upon the baseline multistage model by 18% relative.

Since the domain, intent and argument labeling for training and test data was obtained using CFG-parsers, we did a second experiment that used the predicted transcript from the various models, pipelined with the same CFG-parsers. The CFG-parsers are used to derive domain, intent, and arguments from the predicted transcript. Results are shown in Table 5. The table also shows the overall WER obtained by the various models. Compared to the results in Table 4, we can see that domain F1 scores are similar, but intent F1 scores are better. Interestingly, the argument WER significantly improved. For example, for the Baseline model, WER improved from 15.0 to 11.9. While this is not entirely surprising, since this strategy of predicting semantics matches what is used for generating ground truth labels for training data, it is interesting to see that models that are optimized jointly still work better in terms of intent F1 scores and argument WER. For example, the Multitask model gets an intent F1 score of 97.2, which is better than the baseline by 1.3 points. Similarly, Multistage (SampledSoftmax) and Joint models get an argument WER of 11.3, which is 0.6% absolute better than the baseline. The results show that joint training can also help improve performance of the ASR component of the model when using the original CFG-parser for intent prediction.

In this work, we have proposed and evaluated multiple end-to-end approaches to SLU that optimize the ASR and NLU components of the system jointly. We show that joint optimization results in better performance not just when we do end-to-end domain, intent, and argument prediction, but also when using the transcripts generated by a jointly trained end-to-end model and a conventional CFG-parsers for NLU. Our results highlight several important aspects of joint optimization. We show that having an intermediate text representation

Table 5: Transcription WER, domain and intent F1 scores, and argument WER when NLU is performed on the model’s top recognized transcript using the CFG-parser that was used for generating truth semantic labels during training.

image

is important when learning SLU systems end-to-end. As expected, our results also show that joint optimization helps the model focus on errors that matter more for SLU as evidenced by the lower argument WERs obtained by models that couple ASR and NLU. It was also observed that direct prediction of semantics from audio by ignoring the ground truth transcript, does not perform as well.

There are several interesting avenues to improve performance going forward. As noted before, the amount of training data that is available to train ASR is usually several times larger than what is available to train NLU systems. It would be interesting to understand how a jointly optimized model can make use of ASR data to improve performance. For optimization, the current work uses the cross-entropy loss. Future work will consider more task specific losses, like MBR, that optimizes intent and argument prediction accuracy directly. It is also important to understand how to incorporate new grammars with limited training data into an end-to-end system. The CFG-parsing based approach that decouples itself from ASR can easily incorporate additional grammars. But end-to-end optimization relies on data to learn new grammars, making the introduction of new domains more challenging.

Framing spoken language understanding as a sequence to sequence problem that is optimized end-to-end significantly simplifies the overall complexity. It is also easy to scale such models to more complex tasks, e.g., tasks that involve multiple intents within a single user input, or tasks for which it is not easy to create a CFG-based parser. The ability to run inference without the need of additional resources like a lexicon, language models and parsers also make them ideal for deploying on devices with limited compute and memory footprint.

The authors would like to thank Edgar Gonz`alez Pellicer, Alex Kouzemtchenko, Ben Swanson, Ashish Venugopal, Kai Zhao in help in obtaining labels used as truth for semantics, and Kanishka Rao for discussions around the joint model. We thank Khe Chai Sim and Amarnag Subramanya for helpful feedback on earlier drafts of the paper.

[1] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.

[2] Gokhan Tur and Renato De Mori, Spoken language understanding: Systems for extracting semantic information from speech, John Wiley & Sons, 2011.

[3] Bo Li, Tara Sainath, Arun Narayanan, Joe Caroselli, Michiel Bacchiani, Ananya Misra, Izhak Shafran, Hasim Sak, Golan Pundak, Kean Chin, et al., “Acoustic modeling for google home,” INTERSPEECH-2017, pp. 399–403, 2017.

[4] Young-Bum Kim, Sungjin Lee, and Karl Stratos, “Onenet: Joint domain, intent, slot prediction for spoken language understanding,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017, pp. 547–553.

[5] Dilek Hakkani-T¨ur, G¨okhan T¨ur, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang, “Multi-domain joint semantic frame parsing using bi-directional rnn-lstm.,” in Interspeech, 2016, pp. 715–719.

[6] Andreas Stolcke and Jasha Droppo, “Comparing human and machine errors in conversational speech transcription,” in Proc. INTERSPEECH, 2017, pp. 137–141.

[7] George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, and Phil Hall, “English conversational telephone speech recognition by humans and machines,” in Proc. INTERSPEECH, 2017, pp. 132–136.

[8] “Google duplex: An AI system for accomplishing real-world tasks over the phone,” “https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html,” Accessed: 2018-06-29.

[9] Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj Kumar, Baiyang Liu, and Yoshua Bengio, “Towards end-to-end spoken language understanding,” arXiv preprint arXiv:1802.08395, 2018.

[10] Fabrizio Morbini, Kartik Audhkhasi, Ron Artstein, Maarten Van Segbroeck, Kenji Sagae, Panayiotis Georgiou, David R Traum, and Shri Narayanan, “A reranking approach for recognition and classification of speech input in conversational dialogue systems,” in Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012, pp. 49–54.

[11] Dilek Hakkani-T¨ur, Fr´ed´eric B´echet, Giuseppe Riccardi, and Gokhan Tur, “Beyond asr 1-best: Using word confusion networks in spoken language understanding,” Computer Speech & Language, vol. 20, no. 4, pp. 495–514, 2006.

[12] Faisal Ladhak, Ankur Gandhe, Markus Dreyer, Lambert Math- ias, Ariya Rastrow, and Bj¨orn Hoffmeister, “Latticernn: Recurrent neural networks over lattices.,” in INTERSPEECH, 2016, pp. 695–699.

[13] Raphael Schumann and Pongtep Angkititrakul, “Incorporating asr errors with attention-based, jointly trained rnn for intent detection and slot filling,” in Acoustics, Speech and Signal

Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 2018.

[14] Yuan-Ping Chen, Ryan Price, and Srinivas Bangalore, “Spo- ken language understanding without speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 2018.

[15] Alex Graves, Santiago Fern´andez, Faustino Gomez, and J¨urgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.

[16] Kyunghyun Cho, Bart van Merri¨enboer, C¸ alar G¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, Oct. 2014, pp. 1724–1734, Association for Computational Linguistics.

[17] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.

[18] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mo- hammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.

[19] C.C. Chiu, T.N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.J. Weiss, K. Rao, K. Gonina, and N. Jaitly, “State-of-the-art speech recognition with sequence-to-sequence models,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.

[20] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960–4964.

[21] Bing Liu and Ian Lane, “Attention-based recurrent neural net- work models for joint intent detection and slot filling,” in Proc. Interspeech, 2016, pp. 685–689.

[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.

[23] Chung-Cheng Chiu, Tara Sainath, Yonghui Wu, Rohit Prab- havalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Katya Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” 2018.

[24] Rich Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997.

[25] Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser, “Multi-task sequence to sequence learning,” arXiv preprint arXiv:1511.06114, 2015.

[26] Matt Shannon, “Optimizing expected word error rate via sam- pling for speech recognition,” 2017.

[27] Rohit Prabhavalkar, Tara N Sainath, Yonghui Wu, Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu, and Anjuli Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” arXiv preprint arXiv:1712.01818, 2017.

[28] Hasim Sak, Andrew W. Senior, Kanishka Rao, and Franoise Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” in Proceedings of Interspeech, 2015.

[29] J¨urgen Schmidhuber and Sepp Hochreiter, “Long short-term memory,” Neural Comput, vol. 9, no. 8, pp. 1735–1780, 1997.

[30] Mike Schuster and Kuldip K Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.

[31] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prab- havalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Katya Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” arXiv preprint arXiv:1712.01769, 2017.

[32] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 1171–1179.

[33] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al., “Tensorflow: a system for large-scale machine learning.,” in OSDI, 2016, vol. 16, pp. 265–283.

[34] Norman P Jouppi, Cliff Young, Nishant Patil, David Patter- son, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 2017, pp. 1–12.

[35] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[36] Erik F. Tjong Kim Sang and Fien De Meulder, “Introduction to the conll-2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, Stroudsburg, PA, USA, 2003, CONLL ’03, pp. 142–147, Association for Computational Linguistics.


Designed for Accessibility and to further Open Science