Matching two pieces of text is a common pattern in many natural language processing tasks. For example, in the textual entailment task, given a pair of premise and hypothesis sentences, the task is to classify them into one of three labels {entailment, contradiction, neutral} (Bowman et al. 2015). In paraphrase detection, a pair of sentences need to be classi-fied according to whether they are paraphrases of each other (Dolan and Brockett 2005). In the semantic relatedness task, a pair of sentences need to be scored based on how closely related they are semantically. Other problems like question answering can also be reduced to textual matching by scoring each question-answer pair and picking the answer with the highest score (Wang, Hamza, and Florian 2017). In this paper we assume the texts are a pair of sentences and
. There has been a large amount of work on building machine learning models to solve each of these specific problems or text matching in general. In recent years, neural network models have been able to achieve impressive performance on several benchmark datasets related to these problems. The neural models can be divided into roughly two categories. In the sentence encoder based models, each sentence is encoded into a fixed length distributed represen-
tation using a sequence encoder like a BiLSTM (Hochre- iter and Schmidhuber 1997) acting on the embeddings of the words in the sentence. The two sentence representations are then composed into a single representation by using heuristic matching features like element-wise difference and element-wise product (Mou et al. 2016), which is then passed through a classification layer. These so-called Siamese architectures are simple, but do not take into account the dependencies between words in the two sentences. The other category of neural models incorporate dependencies between words in the two sentences, typically by using an attention mechanism (Rockt¨aschel et al. 2016). The contextual representation of each word in , obtained from the intermediate states of a BiLSTM for example, is composed with the representations of words in
using attention and then compared. This produces a series of representations for words in
dependent on the words in
, which can then be encoded further before being used for classifica-tion. Many of the best results in text matching are achieved by architectures that use some form of inter-sentence attention e.g. the ESIM model for textual entailment (Chen et al. 2017), the BiMPM model for paraphrase detection (Wang, Hamza, and Florian 2017) and DIIN for both (Gong, Luo, and Zhang 2018). While more expressive, these models are quite complex with a large number of parameters. The question remains whether such complex models are absolutely necessary to achieve good performance in text matching problems. In fact, recent work in language modeling has shown that properly regularized vanilla LSTM networks can achieve results that are comparable to the state-of-the-art (Melis, Dyer, and Blunsom 2018) without the need for more complex architectures.
In this paper, we take a middle path and propose a simple Siamese architecture for text matching problems. Each sentence is encoded using a BiLSTM and the representations are composed by computing the element-wise absolute difference and product. Optionally, we also concatenate the original sentence encodings before passing the vector to the classification layer. As mentioned above, inter-sentence dependence information is crucial for good performance in text matching. To avoid the use of complex attention mechanisms, we augment the embeddings of the words to incorporate inter-sentence information. For each word , we add a matching feature that indicates whether t appears in
Figure 1: Schematic diagram of REGMAPR. The original sentence encodings are used for textual entailment and paraphrase detection and the exponential function is used for semantic relatedness.
to the embedding of t. Similarly, for each word
, we add a matching feature that indicates whether t appears in
. Such matching features have been successfully used in neural models for information retrieval (Guo et al. 2016).
While the matching feature provides important syntactic information to the model, it is too restrictive. If two words in and
that are not exactly same but semantically related, there is a good change that this influences the fact that
and
are related through an entailment, paraphrase or semantic relationship. In fact, inter-sentence attention mechanisms try to capture some form of semantic dependency by using the contextual representations derived from a BiLSTM. We take a different approach. We use an external database of paraphrase or semantically related words (Pavlick et al. 2015) to capture dependence. For each word
, we add a paraphrase feature to its embedding that indicates whether a paraphrase of t appears in
. Similarly, for each word
, we add a paraphrase feature that indicates whether a paraphrase of t appears in
. The matching and the paraphrase features add only two dimensions to the embeddings of each word but capture important syntactic and semantic interaction between the words of the two sentences.
The importance of regularization in obtaining good generalization performance is a well established fact in deep learning. Several types of regularization specific to recurrent neural networks have been shown to improve performance of LSTM based models e.g. variational dropout (Gal and Ghahramani 2016) and DropConnect (Merity, Keskar, and Socher 2018). We use three types of regularization to train our models in order to achieve good generalization performance.
The base Siamese architecture augmented with the matching and paraphrase features that capture inter-sentence word interaction and regularization define our model – REGMAPR. We evaluate its performance on six benchmark datasets on textual entailment, paraphrase detection and semantic relatedness. Despite its simplicity, REGMAPR improves upon several existing models which either use complex inter-sentence attention mechanisms or a large number of handcrafted features across all the datasets. It achieves a new state-of-the-art on the SICK dataset for semantic relatedness and on the SNLI dataset for textual entailment among models that do not use inter-sentence attention.
We describe our model by starting from a basic Siamese architecture and augmenting it with additional features. The input to the model is a pair of sentences and
, with each word mapped to its corresponding distributed representation or word embedding. In this paper we use GloVe embeddings (Pennington, Socher, and Manning 2014)). We denote the set of words of
by
for
.
BASE
The basic model uses a standard Siamese architecture. Each sentence is encoded into a single vector using a BiLSTM. As the encoder, we use a max-pooling of the intermediate states of the BiLSTM operating on the sentence. Our choice is inspired by the success of such an encoder in learning general sentence representations (Conneau et al. 2017). In our experiments, we tried other sentence encoders but a max-pooled BiLSTM consistently gave the best results. The encodings of the two sentences and
are composed by concatenating the element-wise absolute difference and element-wise product with the original vectors to form the following feature vector for textual entailment and paraphrase detection.
(1)where ; denotes concatenation. Such matching features have been used successfully for textual entailment in the past (Mou et al. 2016). For semantic relatedness, we only use the absolute difference and product, as follows.
(2)This feature vector is passed through a fully connected layer, followed by ReLU activation, followed by a classification or scoring layer. For semantic relatedness, we produce a single number which is then passed through an exponential function and clamped to 1 to constrain it in the range [0, 1] (Mueller and Thyagarajan 2016).
BASE+REG
Although regularization is strictly not a part of the architecture, we emphasize its importance. Based on the work of (Merity, Keskar, and Socher 2018), we apply three types of regularization.
1. Locked Dropout () after the word embedding layer. In this case, a single dropout mask, where each dimension is dropped with probability
, is selected for a sentence and applied to all the words in the sentence. Also known as variational dropout, its effectiveness has been demonstrated in sequence processing using recurrent neural networks (Gal and Ghahramani 2016).
2. Dropout () after the ReLU activation. This is classical dropout proposed in (Srivastava et al. 2014).
3. Recurrent Dropout () on the recurrent weights in the BiLSTM encoder. First proposed in (Wan et al. 2013), this regularization helps reduce overfitting of the hidden-to-hidden weight matrices in the LSTM. Note that the three types of regularization have been cho-
sen carefully to prevent overfitting in each of the main com-
ponents of our model. As shown later in the paper, they help significantly in getting good generalization performance.
BASE+REG+MA
In this model, in addition to the regularization, we augment the word embeddings with a matching feature, denoted by MA henceforth. That is, for each word t in , we augment the embedding of t as
where 1 is the indicator function. Note that this is a binary feature and provides basic syntactic information to the sentence encoder that the same word is present in both the sentences.
BASE+REG+PR In this model, we augment the word embedding with information about the presence of semantically related words in and
. We use the paraphrase database (PPDB) (Pavlick et al. 2015) as the source dictionary of semantically related words. For each word t in PPDB, we compute the following set
is a paraphrase of t in PPDB} (4) We augment the embedding of
using P(t) as follows.
We call this the paraphrase feature or PR henceforth. This again is a binary feature and is easy to compute once P(t) has been precomputed. Depending on the criteria used for selecting the paraphrase database, this feature can be slightly noisy and yet provides valuable semantic information which is not easily obtainable from the surface forms or word embeddings directly. To the best of our knowledge, this is the first use of PPDB in neural models for text matching.
Table 1: Datasets and the sizes of the respective train, dev and test sets.
BASE+REG+MA+PR
The full REGMAPR model combines regularization with the matching and paraphrase features. The word embedding then becomes
) = [GloVe(
; (6)
]
Note that the full REGMAPR model increases the word embedding by only two dimensions and uses a Siamese architecture for matching the representations of the two sentences. We completely avoid inter-sentence attention mechanisms and instead encode the inter-sentence interaction using the very simple MA and PR features. Crucially, the MA and PR features provide important syntactic and semantic clues that the BiLSTM can exploit. A schematic diagram of the model is shown in Fig. 1. REGMAPR is a general architecture which can be applied to any text matching problem. As we show in the next sections, despite its simplicity, it is highly effective in achieving results comparable or better than more complex models.
Datasets We evaluate our models on six diverse datasets related to three tasks - textual entailment, paraphrase detection and semantic relatedness. For textual entailment, we use the SNLI dataset (Bowman et al. 2015) and the SICK-E dataset (Marelli et al. 2014). Each sentence pair in SNLI and SICKE has a label from the set {entailment, contradiction, neutral}. For paraphrase detection we use the MSRP (Dolan and Brockett 2005) and QUORA (Iyer et al. 2017) datasets . Each sentence pair in these two datasets has a binary label indicating whether they are paraphrases of each other. For semantic relatedness we use the SICK (Marelli et al. 2014) and STS Benchmark (STSB) (Cer et al. 2017) datasets, in which each pair of sentences has a semantic relatedness score. The sizes of the train, dev and test sets for each of these datasets is shown in Table 1. To compute the PR feature, we use the lexical subset of the PPDB paraphrase dataset (specifically the ppdb xxl set). There are about 3.7 million pairs of words with an associated score, which we ignore. There are about 99.6K unique
Figure 2: Histogram of words based on the number of para- phrases in PPDB. Both axes are log scaled. The gap between the first two bars has been truncated for ease of visualization.
words, with more than 50% having less than 11 paraphrases. A histogram of the frequency of words according to the number of paraphrases is shown in Fig. 2.
Training For all experiments, we set the LSTM hidden dimension and the dimension of the fully connected layer to 600. We use 300 dimensional GloVe (Pennington, Socher, and Manning 2014) word embeddings. The word embeddings are not updated during training. For the textual entailment and paraphrase detection tasks, a cross-entropy loss function is used, while for the semantic relatedness task the mean squared error (MSE) between the predicted and ground-truth score is used as the loss function. We optimize the weights of the network using Adam (Kingma and Ba 2015) with a learning rate of 1e-3, which is decayed by 0.5 when the validation performance drops. For each word in PPDB, we construct a one to many map representing its paraphrases. To create the PR feature for a word in a sentence, we lookup this map and check whether there are any words common with the other sentence.
Hyperparameter search for regularization is done over the following ranges – locked dropout , dropout
and recurrent dropout
. For the SICK dataset, the relatedness scores are linearly scaled from [1, 5] to [0, 1]. For the STSB dataset, the scores are scaled from [0, 5] to [0, 1].
We report results for each of the models defined in Section 2 i.e. BASE, BASE+REG, BASE+REG+MA, BASE+REG+PR and BASE+REG+MA+PR for all the six datasets. In all the tables, the best among these is highlighted by bold fonts and the overall best by an underline.
Table 2: Accuracy results on SNLI test set. Previous results are for models without inter-sentence attention.
Table 3: Accuracy results on SICK-E test set.
Textual Entailment
The results of REGMAPR on SNLI are shown in Table 2. Regularization helps push the performance of the base model by 0.7%. Both word matching and paraphrase matching help further, but with a smaller boost of 0.2%. The combination of MA and PR improves model performance by a much larger 0.9% compared to the regularized model only. We compare our results with existing models that do not use inter-sentence attention. REGMAPR sets a new state-of-the-art accuracy of 86.8% for this class of models. Although REGMAPR is not strictly a sentence encoding based model, we do not use sophisticated attention mechanisms like those in ESIM (Chen et al. 2017).
For the SICK-E dataset, REGMAPR achieves 87.0% accuracy on the test set, at par with more complex models that use task specific inter-sentence attention mechanisms (Yin and Sch¨utze 2017). Interestingly, the base model itself achieves 86.4% accuracy. This points to the fact that simple models with an appropriate number of parameters can sometimes achieve similar or better performance than more complex models, even without regularization. The gain from using regularization and the MA and PR features is modest, maxing out at 0.6%. One possible reason for this is the relatively small size of the dataset and the class skew in the training set (56% sentence pairs have neutral labels), as compared to the almost uniform class distribution in the SNLI training set.
Table 4: Accuracy and F1 results on MSRP test set.
Table 5: Accuracy results on the QUORA dev and test set.
Paraphrase Detection The results for the MSRP dataset are shown in Table 4. The trend here is also quite clear with increasing performance as regularization, MA and PR features are added. REGMAPR achieves an accuracy of 79.1%, at par with the results obtained by (Filice, Martino, and Moschitti 2015), where the authors use, among other things, a combination of more than five handcrafted features (including the MA feature) in a non-neural model. Our model is surpassed only by (Ji and Eisenstein 2013), where the authors use a combination of a term frequency based model and 10 fine grained features.
The results on the QUORA dataset are shown in Table 5. The full REGMAPR model achieves a test accuracy of 88.64%. This is better than the BiMPM model of (Wang, Hamza, and Florian 2017) and pt-DECATTchar model of (Tomar et al. 2017), both of which use attention mechanisms and, in the latter case, heavy data augmentation. For both the paraphrase datasets, the combination of MA and PR features works the best, reflecting their complementary strengths.
We emphasize the good performance of REGMAPR on two of the largest datasets (SNLI and QUORA) considered in this paper. More complex models like DIIN (Gong, Luo, and Zhang 2018) do eventually perform better on both but at the cost of significantly increased model complexity and training time.
Semantic Relatedness For the SICK dataset, REGMAPR sets a new state-of-the-art performance of 0.8864 Pearson correlation, without us-
Table 6: Pearson correlation (r), Spearman correlation () and MSE w.r.t the ground-truth scores on SICK test set.
Table 7: Dev and test Pearson correlation (r) w.r.t the ground-truth scores on STSB test set.
ing any of the additional features or post-processing used by (Mueller and Thyagarajan 2016). In fact, their base Siamese model built using an LSTM achieves a Pearson correlation which is about 0.07 less than our base model which achieves 0.8842 Pearson correlation. This points to the significance of using a good sentence encoder. The improvements obtained from the MA and PR features are significantly lesser than the improvements seen in textual entailment and paraphrase detection. This can be partially explained by the fact that capturing semantic relatedness is a more complex problem.
The results on the STS Benchmark are shown in Table 7. Here again, the full REGMAPR model performs better than previous neural models like (Shao 2017) by almost 1.1% in Pearson correlation on the test set. Compared to REGMAPR, better performing models like ECNU (Tian et al. 2017) use a large number of handcrafted features (66 to be precise) and (Yang et al. 2018) use transfer learning after training on much larger datasets.
In this section, we analyze the results presented in the previous section by comparing the contribution of each of the main components of REGMAPR. In Fig. 3, we summarize the gains in performance due to each component of REGMAPR over the base model for all the six datasets. Some trends can be ascertained from the plot. In all the
Figure 3: Gains in test set performance by using REG, MA and PR over the BASE model. We use Pearson correlation for SICK and STSB and accuracy percentage for others.
cases, the MA feature helps equally or more than the PR feature. The combination of MA and PR consistently produces the highest gains, which justifies our choice of modeling inter-sentence word interaction using both of these features.
To further illustrate the effect of the two features, we investigate the correlation of the the presence of these features with the class labels or scores. For each of the datasets, we partition the training set into two classes - positive (P) and negative (N). For MSRP and QUORA, pairs of sentences that are paraphrases are positive and pairs that are not paraphrases are negative. For SNLI and SICK-E, pairs of sentences that have the entailment label are positive and those that have the contradiction label are negative. For SICK and STSB, we compute the average semantic relatedness score in the training set and mark all pairs that have score greater than equal to the average as positive and the remaining as negative. Next, over all the pairs in P, we compute the proportion of words that have the MA feature on as follows.
(7) Similarly, we compute the proportions
and
. For MA+PR, we count the words that have both the features on. These proportions may be interpreted as estimates of how likely a word occurring in a sentence pair with a positive (or negative) label has the MA, PR or both features on. Finally, to estimate the predictive power of the features, we condense the ratios for the positive and negative classes into one number as follows.
X=PR X=MA X=MA+PR
Figure 4: Estimate of predictive power of MA and PR.
Figure 5: Gains in test set performance by using REGularization over BASE, BASE+MA, BASE+PR and BASE+MA+PR.
The ratios and
are defined similarly. A higher positive value of this ratio means that it is more likely that the corresponding feature is present in words occurring in sentences in P as compared to N and hence can have higher predictive power.
We plot the values of and
for each of the six datasets in Fig. 4. Except for the SICK-E dataset, there is a consistent increase as we move from
, to
and finally to
. This partially explains the relative gains shown in Fig. 3. Since we used the PPDB database without any filtering, there may be noisy paraphrases resulting in values of
that are close to zero for some datasets.
Finally, to evaluate the effect of regularization (REG), we plot the gains obtained by including it in the presence or absence of the MA and PR features in Fig. 5. It is clear that proper regularization helps a great deal in improving generalization and hence test set performance. The trends across the features here are less clear, although regularization tends to help most when only the PR feature is used.
Text matching is a general problem in NLP and the three specific tasks considered in this paper have a long history of their own. For textual entailment, most of the state-of-the-art architectures for the SNLI dataset use inter-sentence attention mechanisms, canonical examples being ESIM (Chen et al. 2017) and BiMPM (Wang, Hamza, and Florian 2017). The attention in these models is typically computed using the intermediate states of a BiLSTM. On the other hand, we encode the word interaction features in the word embeddings and are not parameter free. Purely Siamese architectures have been used in the past for textual entailment (Mou et al. 2016), but they have lower accuracy. For paraphrase detection, many of the best models for the MSRP dataset use a combination of handcrafted features in a non-neural setting. These include KL-Divergence based features in (Ji and Eisenstein 2013) and syntactic and semantic matching features in (Filice, Martino, and Moschitti 2015). The latter work uses at least five features, including the MA feature. REGMAPR uses fewer and simpler features and incorporates them in a neural model to achieve comparable performance. On the larger QUORA dataset, most existing models use inter-sentence attention mechanisms. Previous work on semantic relatedness has centered largely on combining neural models with handcrafted features. The state-of-the-art model for STSB (Tian et al. 2017) combines a total of 66 features with a neural model. Our model achieves comparable performance with far lesser complexity. The use of the exponential function for semantic relatedness scoring was first introduced by (Mueller and Thyagarajan 2016) who apply it directly to the L1 distance between the encodings of the two sentences. Our model uses a more complex matching function followed by fully connected layers before applying the exponential function. In a related work, the authors in (Tymoshenko and Mos- chitti 2015) devise a number of syntactic and semantic matching features for the answer passage reranking task from information retrieval in a non-neural setting. The syntactic features include the MA feature and the semantic matching features are derived from external resources like YAGO, DBPedia and Wikipedia. In our model, we use PPDB as the only external resource of semantic matching information and let the neural network learn the features that are most suitable for a particular task. Finally, there has been some work on exploring general architectures for text matching problems. The work of (Wang and Jiang 2017a) explores various techniques for estimating word interaction through an attention mechanism. Similar approaches were explored in (He and Lin 2016), (Parikh et al. 2016) and (Wang and Jiang 2017b). Among
more recent models, (Gong, Luo, and Zhang 2018) and (Kim et al. 2018) have achieved state-of-the-art performance on the SNLI and QUORA datasets using multi-layered and highly complex attention mechanisms.
The use of dropout regularization for better generalization is a well established principle in deep learning. Our work adopts a subset of the suite of regularizations successfully used in language modeling by (Merity, Keskar, and Socher 2018). Of particular significance is the use of recurrent dropout or DropConnect (Wan et al. 2013) for the weights of the BiLSTM. To the best of our knowledge, ours is the first work to explore its use in text matching problems.
In this paper, we propose REGMAPR – a neural model for text matching that incorporates simple word interaction features in a Siamese architecture and train it with three different types of regularization. Our model performs comparably or better than many existing models that use complex inter-sentence attention mechanisms or many handcrafted features on six diverse datasets. In future work, we plan to explore further ways to infuse inter-sentence semantic information in the word embeddings and matching heuristics over multi-layered sentence representations.
[Bowman et al. 2015] Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. In EMNLP.
[Cer et al. 2017] Cer, D. M.; Diab, M. T.; Agirre, E.; Lopez- Gazpio, I.; and Specia, L. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In SemEval@ACL.
[Chen et al. 2017] Chen, Q.; Zhu, X.; Ling, Z.; Wei, S.; and Jiang, H. 2017. Enhancing and combining sequential and tree LSTM for natural language inference. In ACL.
[Chen, Ling, and Zhu 2018] Chen, Q.; Ling, Z.-H.; and Zhu, X.-D. 2018. Enhancing sentence embedding with generalized pooling. In COLING.
[Conneau et al. 2017] Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; and Bordes, A. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP.
[Dolan and Brockett 2005] Dolan, W. B., and Brockett, C. 2005. Automatically constructing a corpus of sentential paraphrases. In IWP@IJCNLP.
[Filice, Martino, and Moschitti 2015] Filice, S.; Martino, G. D. S.; and Moschitti, A. 2015. Structural representations for learning relations between pairs of texts. In ACL.
[Gal and Ghahramani 2016] Gal, Y., and Ghahramani, Z. 2016. A theoretically grounded application of dropout in recurrent neural networks. In NIPS.
[Gong, Luo, and Zhang 2018] Gong, Y.; Luo, H.; and Zhang, J. 2018. Natural language inference over interaction space. In ICLR.
[Guo et al. 2016] Guo, J.; Fan, Y.; Ai, Q.; and Croft, W. B. 2016. A deep relevance matching model for ad-hoc retrieval. In CIKM.
[He and Lin 2016] He, H., and Lin, J. J. 2016. Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. In HLT-NAACL.
[He, Gimpel, and Lin 2015] He, H.; Gimpel, K.; and Lin, J. J. 2015. Multi-perspective sentence similarity modeling with convolutional neural networks. In EMNLP.
[Hochreiter and Schmidhuber 1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
[Im and Cho 2017] Im, J., and Cho, S. 2017. Distance-based self-attention network for natural language inference. CoRR abs/1712.02047.
[Iyer et al. 2017] Iyer, S.; Dandekar, N.; ; and Csernai, K. 2017. First quora dataset release: Question pairs. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
[Ji and Eisenstein 2013] Ji, Y., and Eisenstein, J. 2013. Dis- criminative improvements to distributional sentence similarity. In EMNLP.
[Kim et al. 2018] Kim, S.; Hong, J.-H.; Kang, I.; and Kwak, N. 2018. Semantic sentence matching with denselyconnected recurrent and co-attentive information. CoRR abs/1805.11360.
[Kingma and Ba 2015] Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
[Lai and Hockenmaier 2014] Lai, A., and Hockenmaier, J. 2014. Illinois-lh: A denotational and distributional approach to semantics. In SemEval@COLING.
[Madnani, Tetreault, and Chodorow 2012] Madnani, N.; Tetreault, J. R.; and Chodorow, M. 2012. Re-examining machine translation metrics for paraphrase identification. In HLT-NAACL.
[Marelli et al. 2014] Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; and Zamparelli, R. 2014. A sick cure for the evaluation of compositional distributional semantic models. In LREC.
[Melis, Dyer, and Blunsom 2018] Melis, G.; Dyer, C.; and Blunsom, P. 2018. On the state of the art of evaluation in neural language models. In ICLR.
[Merity, Keskar, and Socher 2018] Merity, S.; Keskar, N. S.; and Socher, R. 2018. Regularizing and optimizing LSTM language models. In ICLR.
[Mou et al. 2016] Mou, L.; Men, R.; Li, G.; Xu, Y.; Zhang, L.; Yan, R.; and Jin, Z. 2016. Natural language inference by tree-based convolution and heuristic matching. In ACL.
[Mueller and Thyagarajan 2016] Mueller, J., and Thyagara- jan, A. 2016. Siamese recurrent architectures for learning sentence similarity. In AAAI.
[Nie and Bansal 2017] Nie, Y., and Bansal, M. 2017. Shortcut-stacked sentence encoders for multi-domain inference. In RepEval@EMNLP.
[Parikh et al. 2016] Parikh, A. P.; T¨ackstr¨om, O.; Das, D.; and Uszkoreit, J. 2016. A decomposable attention model for natural language inference. In EMNLP.
[Pavlick et al. 2015] Pavlick, E.; Rastogi, P.; Ganitkevitch, J.; Durme, B. V.; and Callison-Burch, C. 2015. Ppdb 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In ACL.
[Pennington, Socher, and Manning 2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP.
[Rockt¨aschel et al. 2016] Rockt¨aschel, T.; Grefenstette, E.; Hermann, K. M.; Kocisky, T.; and Blunsom, P. 2016. Reasoning about entailment with neural attention. In ICLR.
[Shao 2017] Shao, Y. 2017. Hcti at semeval-2017 task 1: Use convolutional neural network to evaluate semantic textual similarity. In SemEval@ACL.
[Shen et al. 2018] Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Wang, S.; and Zhang, C. 2018. Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling. CoRR abs/1801.10296.
[Shen, Yang, and Deng 2017] Shen, G.; Yang, Y.; and Deng, Z.-H. 2017. Inter-weighted alignment network for sentence pair modeling. In EMNLP.
[Srivastava et al. 2014] Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15:1929–1958.
[Tan et al. 2018] Tan, C.; Wei, F.; Wang, W.; Lv, W.; and Zhou, M. 2018. Multiway attention networks for modeling sentence pairs. In IJCAI.
[Tay, Tuan, and Hui 2018] Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018. A compare-propagate architecture with alignment factorization for natural language inference.
[Tian et al. 2017] Tian, J.; Zhou, Z.; Lan, M.; and Wu, Y. 2017. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In SemEval@ACL.
[Tomar et al. 2017] Tomar, G. S.; Duque, T.; T¨ackstr¨om, O.; Uszkoreit, J.; and Das, D. 2017. Neural paraphrase identification of questions with noisy pretraining. In SWCN@EMNLP.
[Tymoshenko and Moschitti 2015] Tymoshenko, K., and Moschitti, A. 2015. Assessing the impact of syntactic and semantic structures for answer passages reranking. In CIKM.
[Wan et al. 2013] Wan, L.; Zeiler, M. D.; Zhang, S.; LeCun, Y.; and Fergus, R. 2013. Regularization of neural networks using dropconnect. In ICML.
[Wang and Jiang 2017a] Wang, S., and Jiang, J. 2017a. A compare-aggregate model for matching text sequences. In ICLR.
[Wang and Jiang 2017b] Wang, S., and Jiang, J. 2017b. Ma- chine comprehension using match-lstm and answer pointer. In ICLR.
[Wang, Hamza, and Florian 2017] Wang, Z.; Hamza, W.; and Florian, R. 2017. Bilateral multi-perspective matching for natural language sentences. In IJCAI.
[Wu et al. 2017] Wu, H.; Huang, H.; Jian, P.; Guo, Y.; and Su, C. 2017. Bit at semeval-2017 task 1: Using semantic information space to evaluate semantic textual similarity. In SemEval@ACL.
[Yang et al. 2018] Yang, Y.; Yuan, S.; Cer, D.; yi Kong, S.; Constant, N.; Pilar, P.; Ge, H.; Sung, Y.-H.; Strope, B.; and Kurzweil, R. 2018. Learning semantic textual similarity from conversations. CoRR abs/1804.07754.
[Yin and Sch¨utze 2017] Yin, W., and Sch¨utze, H. 2017. Task-specific attentive pooling of phrase alignments contributes to sentence matching. In EACL.
[Yin et al. 2016] Yin, W.; Sch¨utze, H.; Xiang, B.; and Zhou, B. 2016. ABCNN: Attention-based convolutional neural network for modeling sentence pairs. TACL 4:259–272.
[Zhou, Liu, and Pan 2016] Zhou, Y.; Liu, C.; and Pan, Y. 2016. Modelling sentence pairs with tree-structured attentive encoder. In COLING.