A dialogue act (DA) is defined as the function of a speaker’s utterance during a conversation [24], for example, question, answer, request, suggestion, etc. The last two decades have seen many developments in automatic classification of DAs both in spoken [1, 32], and written [17, 36] conversations. With the increased use of instant messaging and group chat applications, written conversations have become highly prevalent in the modern social and business world. Accurate identification of DAs, especially in business group chats, has many applications such as conversation summarization, question answering, and workflow automation (e.g. reservation systems, scheduling assistants). It is also a critical component in building end-to-end conversational systems [35, 37, 38]. However, DA classification in written conversations comes with many interesting challenges. Group chats may contain multiple parties conversing simultaneously which leads to entanglement of utterances. Namely, a given utterance could be responding to an utterance that is many turns above or could be starting a brand new conversational thread. Unlike spoken conversations, written conversations do not have any prosodic cues, which have been shown to be useful for DA modeling [11, 31]. Due to informal nature of group chats, they tend to contain domain-specific jargon, abbreviations, and emoticons, which further adds to modeling challenges.
Several classical machine learning techniques have been applied to DA classification [1, 26, 33]. More recently, with the advances in neural networks, deep learning architectures such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and their hierarchical variants have also been used for this problem [4, 16, 21, 27]. Typically, in these models, the DA of a given utterance is predicted based on three factors: (1) textual content of the utterance, (2) user turns, and (3) contextual information. User turns are usually captured as simple binary features: if the current utterance is from the same user as the prior utterance. Context is obtained either from surrounding utterances within a pre-defined neighborhood window (e.g. the prior two utterances), or from the entire dialog history where influence reduces with distance.
In group chats that have many entangled conversational threads, utterances from a fixed context window might not contain pertinent or sufficient information for DA classification. Likewise, representing context as a flat sequence of all prior utterances would not capture user information: which utterance was posted by which user. This creates a need for a more systematic way to incorporate contextual information for DA classification. To address this issue, we introduce DAG-LSTM, based on tree-LSTMs [34], which are a generalization of LSTMs that support richer network topologies by allowing each LSTM unit to incorporate information from multiple parent units. However, multiple parent units in an LSTM can lead to state explosions because the number of additive terms increase exponentially with the length of the conversation. To this end, we modify the memory cell operations to choose an elementwise maximum over multiple vectors thus effectively choosing a path through one of the child units. We exploit this model architecture to integrate more relevant contextual information for DA classification, and we show that the proposed approach performs better compared to regular LSTMs, CNNs and their hierarchical variants.
The rest of the paper is organized as follows. In the next section, we discuss some historical and relevant work in DA modeling. In Section 3, we formulate DA classification as a DAG-LSTM and describe all involved components. This is followed by our experimental work and results. We wrap up the paper with our conclusions and directions for future work.
Dialogue acts have been studied in linguistics from as early as the 1960s [3, 28]. They have become part of computational linguistics [5] in the last two decades especially with the availability of annotated corpora such as the Switchboard corpus [15] and the Meeting Recorder Dialogue Act (MRDA) corpus [30]. The Switchboard corpus contains utterances from over 1,155 one-on-one telephonic conversations annotated into 42 different DAs. The MRDA corpus has 75 multi-party meetings labeled into over 50 different DAs. Researchers have used many different machine learning algorithms for DA classification such as Hidden Markov Models (HMM) [33], Support Vector Machines (SVM) [26], Maxent classifiers [1], and Dynamic Bayesian Networks (DBN) [10].
Kalchbrenner et al. [16] was one of the first to apply deep learning approaches for DA classification, where they used Recurrent Convolutional Neural Networks. Barahona et al. [4] used a combination of CNNs for sentence representation and LSTMs for context representation. Both Liu et al. [21] and Ribeiro et al. [27] have studied various combinations of CNNs, LSTMs, and BiLSTMs for sentence and contextual representation.
Though the majority of research in DA classification has focused on spoken conversations, some recent works have started building DA datasets for written conversations. Kim et al. [18] have published a corpus with 33 live online discussion annotated under 14 dialogue acts. Likewise, Forsyth et al. [12] published the NPS chat corpus which contains over 10000 annotated posts that were gathered from various online chat services. More recently, Asher et al. [2] published the STAC corpus which includes strategic chat conversations from an online version of the game The Settlers of Catan. Another relevant corpus is DailyDialog [20] which contains one-on-one conversations annotated for emotions, topics, and DAs. For a thorough overview of various dialog-system related corpora, please refer to the review paper by Serban et al. [29].
3.1 Problem Statement
LetU be a group chat consisting of a sequence of utterances where an utterance
is written by one of the participants
where P denotes the set of chat participants. Given this sequence, the DA classification problem is to assign each utterance
where Y denotes the pre-defined dialogue acts.
3.2 Formulation
We formulate the above stated problem as a sequence modeling task solved specifically using a variant of Tree-LSTMs as described below. Let utterance contain a sequence of words
is a shorthand for
). We first map each word
fixed-size word vector
. Then, an utterance model is used to compute a vector representation
for the entire utterance
, given its word vector sequence
. Subsequently, a conversation model is used to compute another vector representation
all of the previous utterance vectors
, which contextualizes
and summarizes the state of the conversation so far. Finally, a classifier layer is used to map
to
.
Below, we describe each of the components in further detail.
Representing Utterances. We use a bidirectional LSTM to represent each utterance [13, 14]. Let lstmbe recursively defined as follows:
where the step function is defined such that
where are weight matrices and bias vectors, respectively,
are input, forget, and output gates, respectively, and
denotes elementwise product.
When the recurrence is defined in terms of the past as above, we get a forward directed LSTM, denoted lstm. Alternatively, we can define it in terms of the future to get a backward directed LSTM:
Concatenating , we get a contextualized representation of word
inside the utterance:
Finally, contextualized representations are (affinely) transformed into a feature space, which is then pooled across all the words in the utterance:
where max denotes the elementwise maximum across multiple vectors and and
are the weight matrix and a bias vector, respectively. At the end of this operation, we have a single fixed size vector
that represents an entire utterance
.
Representing sequences of utterances. Given a sequence of utterance vectors, a simple method to represent it would
Figure 1: Overall architecture over an example utterance sequence. Arrow color-style combination encodes shared connections within the model. Note the skip connections between consecutive utterances from the same participant, which are added in the form of additional children.
be to use another LSTM model and feed the contextualized (given the history of past utterances) utterance vector to a final classifier
layer:
whereis a weight matrix and vector, and ˆ
denotes the predicted probability distribution over the dialogue act set Y.
In this approach, a conversation would be represented as a flat sequence of utterances with no information about which utterance belongs to which participant. In order to address this, we add skip connections between consecutive posts from the same participant. This means that each utterance has two antecedents: (1) past utterance and (2) past utterance from the same participant. Doing so, we achieve two things: the model can build up a user history and link each utterance to a user’s particular history within conversation, and also make utterances from the same user closer in the computation graph.
To this end, we employ Tree-LSTM equations which were previously used for similar computation graphs where each node in a graph has more than one child [34].
Let tlstmdenote a Tree-LSTM where
is a node in a given tree or a graph, Sp
denotes the index set for the subtree (subgraph) spanned by
denotes the nodes spanned by
tlstm is recursively defined in terms of the children of
:
tlstmsteptlstm
where the step function is defined such that:
where denotes the edge type (or label) that connects
to
. In general, E can be an arbitrary fixed size set. In our work it is of size two: edges that connect past utterance to current utterance and edges that connect past utterance from the same participant to the current utterance. Since weights are parametrized by the edge types
, contribution of past utterance vs. past utterance from the same participant is computed differently.
Note that we can apply Tree-LSTM equations even though our computation graphs are not trees but directed acyclic graphs (DAGs), since each node feeds into not one parent but two (next utterance and next utterance from the participant).
A key observation related to this fact is as follows: Let us consider the sink node (last utterance in a conversation) memory cell Since each node cell
contributes to not one but two other cells
additively, recursively unfolding Eq. 25 for
exponentially many additive terms of
in the length of the shortest path from
. This causes very quick state explosions in the length of a conversation, which we experimentally confirm.
To combat this, we make a very simple modification to Eq. 25 as follows:
where max denotes the elementwise maximum over multiple vectors, which effectively picks (in an elementwise fashion) a path through either one of the children. Thus, cell growth can be at worst linear in the conversation length.
Since the modified equations are more appropriate for DAGs compared to Tree-LSTMs which suffer from explosions, we call the modified model DAG-LSTM. Note that both Tree-LSTM and DAG-LSTM reduces to classical LSTMs (without the peephole connections) when each node has exactly one children and one parent.
On top of the DAG-LSTM, classification layer works same as before:
Related architectures. To our knowledge there are two DAGbased variants of the LSTM architecture [9, 39], both of which operate differently than ours. In [39], nodes with multiple children require a binarization operation specific to a task, and the contribution of each child is computed using the same weights. In [9], the architecture operates more similarly to ours, however the past cell state of a parent node is defined as a simple sum of all children states, and subsequently, traditional LSTM updates are used. Our approach is the most faithful to the original Tree-LSTM updates which have been studied before in many applications [7, 8, 22, 34, 40]. Furthermore, neither of the DAG-based approaches address the inherent problem of state explosion as described in Section 3.2.
We compare the performance of our proposed model with four baseline architectures that were employed in prior works [4, 21, 27]. The first baseline model uses CNNs for both utterance and context representation. The second baseline model uses BiLSTMs for utterance representation and LSTMs for context representation. The third model employs CNNs for utterance representation and LSTMs for context representation. The last baseline model uses BiLSTMs for utterance representation and has no context representation. Finally, our model uses BiLSTMs for utterance representation and DAG-LSTMs for context representation. We explicitly chose not to use BiLSTMs for context representation because such architectures are not viable for live systems. We evaluated all five models on the STAC corpus.
Table 1: Frequency of dialog acts
Data. The STAC corpus [2] contains conversations from an online version of the game The Settlers of Catan, where trade negotiations were carried out in a chat interface. The data contains over 11000 utterances from 41 games annotated for various tasks such as anaphoric relations, discourse units, and dialog acts. For our experimental work, we only used the dialog act annotations. The corpus had six different DAs but one of those acts named Preference had very low prevalence (only 8 utterances). Therefore, we excluded it from our experimental work. Table 1 below shows utterance counts for all six DAs.
We split the data randomly into three groups: train (29 games -8250 utterances), dev (4 games - 851 utterances), and test (8 games -2329 utterances). The utterances were tokenized using the Stanford PTBTokenizer [23] and the tokens are represented using GloVe embeddings [25].
Setting. We use Adam optimizer in the stochastic gradient descent setting to train all models [19]. We use a patience value of 15 epochs, i.e. training is stopped after not observing an improvement for 15 epochs in the validation data, and train for a maximum of 300 epochs. We pick the best iteration based on the validation macro-F1 score. All five models have been hyperparameter-tuned using validation set macro-F1 using simple random search. We run a total of 100 experiments to evaluate random hyperparameter candidates based on the following distributions (whenever applicable to a particular architecture): Learning rate 10Uniform
Dropout rate
Uniform(0, 0.5) Word dropout rate
Uniform(0, 0.3) Word vector update mode
Uniform{fixed, fine-tune} #Units in utterance layer
Uniform{50, 75, 100, 200} #Units in conversation layer
Uniform{50, 75, 100, 200} #Filters in CNNs
Uniform{50, 75, 100, 200} Window size for CNNs
Uniform{2,3,4}
Table 2 summarizes the results in terms of F1 scores for individual classes, overall accuracy, and macro-F1 score. The BiLSTM + DAGLSTM architecture achieves the best F1 score for four classes. The overall accuracy of 87.69% is 0.86% better than the second best model (BiLSTM + LSTM). Likewise, the macro-F1 score of 75.78% is over 1% better than the next best model.
The owners of STAC corpus have presented results [6] from using CRFs for this problem on a preliminary version of the dataset which contained utterances from only 10 games. Their models are reported to have achieved 83% accuracy and 73% macro-F1 score.
Table 2: F1 scores of various classes for different models.
Figure 2: Confusion matrices for (left) BiLSTM + LSTM and (right) BiLSTM + DAG-LSTM. Rows denote gold labels whereas columns denote predicted labels by the model.
Though these numbers are not directly comparable with the results in Table 2, we wanted to present them here for complete context.
Confusion matrices for BiLSTM + LSTM and BiLSTM + DAGLSTM are shown in Figure 2. We see that BiLSTM + DAG-LSTM has less confusion correctly classifying offers, specifically by avoiding mistakenly classifying as Counteroffer. This suggests that using additional context information provided by skip connections is helpful, since the utterances for Offer and Counteroffer are typically similar. We also see less confusion misclassifying Refusal as Other.
To verify these effects we present some example conversations with predicted outputs from BiLSTM + LSTM and BiLSTM + DAGLSTM in Table 3, to showcase errors made by different models. In these examples, we repeat the observation that the architecture with DAG-LSTM has less confusion between Offer and Counteroffer. We also observe some utterances that both architectures failed to classify correctly: second utterance in second conversation and second to last post in the last conversation.
In this paper, we introduced a new architecture (DAG-LSTM) that provides a systematic way to incorporate contextual information for DA classification. We evaluated the model on STAC corpus and compared it with hierarchical LSTM and CNN models. Our experimental work shows that the DAG-LSTM achieves much better accuracy and macro-F1 scores compared to state-of-the-art baseline models. In particular, the results demonstrate that information about the prior utterance made by a speaker is very useful in DA classification. This approach facilitates learning relevant context by skipping potentially irrelevant utterances from other speakers in a chat room. We can extend this idea to other types of context, such as the prior utterance from the same team when group membership is available, or prior utterance from the same conversational thread. We propose to experiment with these different types of context in the future.
In this paper, we mainly focused on multi-party written conversations. Despite growing interest in dialog modeling, there are rather very few datasets with DA annotations in the written domain especially in a group-chat setting, excluding transcribed versions of spoken conversations. To the best of our knowledge, there are only two such datasets: the STAC Corpus and the NPS chat corpus. Unfortunately, we could not use the NPS chat corpus because of licensing issues. We expect more such datasets will be made available for research in the future because of the wide-spread usage of group chat applications.
Though our paper has focused only on DA classification, we expect the architecture presented here could be used to address many other aspects of dialog modeling such as emotion classification, sentiment analysis, and thread disentanglement. Likewise, we expect the methodology presented here could easily be extended to address DA classification in spoken conversations.
Table 3: Errors made by different models.
The authors would like to thank Heidi Johnson, the Community, Collaboration and Compliance Product team, and the Office of the CTO at Bloomberg for their support in the development of this paper.
[1] Jeremy Ang, Yang Liu, and Elizabeth Shriberg. 2005. Automatic dialog act segmentation and classification in multiparty meetings. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., Vol. 1. IEEE, I–1061.
[2] Nicholas Asher, Julie Hunter, Mathieu Morey, Farah Benamara, and Stergos D. Afantenos. 2016. Discourse Structure and Dialogue Acts in Multiparty Dialogue: the STAC Corpus. In LREC.
[3] John Langshaw Austin. 1975. How to do things with words. Oxford university press.
[4] Lina M Rojas Barahona, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, Stefan Ultes, Tsung-Hsien Wen, and Steve Young. 2016. Exploiting sentence and context representations in deep neural models for spoken language understanding. arXiv preprint arXiv:1610.04120 (2016).
[5] Harry Bunt, Jan Alexandersson, Jae-Woong Choe, Alex Chengyu Fang, Koiti Hasida, Volha Petukhova, Andrei Popescu-Belis, and David R Traum. 2012. ISO 24617-2: A semantically-based standard for dialogue annotation.. In LREC. 430– 437.
[6] Anais Cadilhac, Nicholas Asher, Farah Benamara, and Alex Lascarides. 2013. Grounding strategic conversation: Using negotiation dialogues to predict trades in a win-lose game. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 357–368.
[7] Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang. 2016. Enhancing and combining sequential and tree lstm for natural language inference. arXiv preprint arXiv:1609.06038 (2016).
[8] Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. In Advances in Neural Information Processing Systems. 2547–2557.
[9] Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017. DAG-based Long Short-Term Memory for Neural Word Segmentation. arXiv preprint arXiv:1707.00248 (2017).
[10] Alfred Dielmann and Steve Renals. 2008. Recognition of dialogue acts in multi-party meetings using a switching DBN. IEEE transactions on audio, speech, and language processing 16, 7 (2008), 1303–1314.
[11] Raul Fernandez and Rosalind W Picard. 2002. Dialog act classification from prosodic features using support vector machines. In Speech Prosody 2002, International Conference.
[12] Eric N Forsythand and Craig H Martell. 2007. Lexical and discourse analysis of online chat dialog. In International Conference on Semantic Computing (ICSC 2007). IEEE, 19–26.
[13] Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5-6 (2005), 602–610.
[14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[15] Daniel Jurafsky, Rebecca Bates, Noah Coccaro, Rachel Martin, Marie Meteer, Klaus Ries, Elizabeth Shriberg, Andreas Stolcke, Paul Taylor, and Carol Van EssDykema. 1997. Automatic detection of discourse structure for speech recognition and understanding. In 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings. IEEE, 88–95.
[16] Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Convolutional Neural Networks for Discourse Compositionality. In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality. 119–126.
[17] Su Nam Kim, Lawrence Cavedon, and Timothy Baldwin. 2010. Classifying dialogue acts in one-on-one live chats. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 862–871.
[18] Su Nam Kim, Lawrence Cavedon, and Timothy Baldwin. 2012. Classifying dialogue acts in multi-party live chats. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation. 463–472.
[19] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[20] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 986–995.
[21] Yang Liu, Kun Han, Zhao Tan, and Yun Lei. 2017. Using context information for dialog act classification in DNN framework. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2170–2178.
[22] Jean Maillard, Stephen Clark, and Dani Yogatama. 2017. Jointly learning sentence embeddings and syntax with unsupervised tree-lstms. arXiv preprint arXiv:1705.09189 (2017).
[23] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60.
[24] Michael McTear, Zoraida Callejas, and David Griol. 2016. The conversational interface: Talking to smart devices. Springer.
[25] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
[26] Eugénio Ribeiro, Ricardo Ribeiro, and David Martins de Matos. 2015. The influence of context on dialogue act recognition. arXiv preprint arXiv:1506.00839 (2015).
[27] Eugénio Ribeiro, Ricardo Ribeiro, and David Martins de Matos. 2018. Deep Dialog Act Recognition using Multiple Token, Segment, and Context Information Representations. arXiv preprint arXiv:1807.08587 (2018).
[28] John R Searle and John Rogers Searle. 1969. Speech acts: An essay in the philosophy of language. Vol. 626. Cambridge university press.
[29] Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2018. A survey of available corpora for building data-driven dialogue systems: The journal version. Dialogue & Discourse 9, 1 (2018), 1–49.
[30] Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, Jeremy Ang, and Hannah Carvey. 2004. The ICSI meeting recorder dialog act (MRDA) corpus. In Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004.
[31] Vivek Kumar Rangarajan Sridhar, Srinivas Bangalore, and Shrikanth Narayanan. 2009. Combining lexical, syntactic and prosodic cues for improved online dialog act tagging. Computer Speech & Language 23, 4 (2009), 407–422.
[32] Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics 26, 3 (2000), 339–373.
[33] Dinoj Surendran and Gina-Anne Levow. 2006. Dialog act tagging with support vector machines and hidden Markov models. In Ninth International Conference on Spoken Language Processing.
[34] Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1. 1556–1566.
[35] Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1711–1721.
[36] Tianhao Wu, Faisal M Khan, Todd A Fisher, Lori A Shuler, and William M Pottenger. [n. d.]. Posting act tagging using transformation-based learning. In Foundations of data mining and knowledge discovery. Springer, 319–331.
[37] Steve Young, Milica Gašić, Blaise Thomson, and Jason D Williams. 2013. Pomdpbased statistical spoken dialog systems: A review. Proc. IEEE 101, 5 (2013), 1160–1179.
[38] Tiancheng Zhao and Maxine Eskenazi. 2016. Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning. In 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 1.
[39] Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2016. Dag-structured long short-term memory for semantic compositionality. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 917–926.
[40] Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In International Conference on Machine Learning. 1604– 1612.