Depth-Adaptive Graph Recurrent Network for Text Classification

2020·Arxiv

Abstract

Abstract

The Sentence-State LSTM (S-LSTM) (Zhang et al., 2018) is a powerful and high effi-cient graph recurrent network, which views words as nodes and performs layer-wise recurrent steps between them simultaneously. Despite its successes on text representations, the S-LSTM still suffers from two drawbacks. Firstly, given a sentence, certain words are usually more ambiguous than others, and thus more computation steps need to be taken for these difficult words and vice versa. However, the S-LSTM takes fixed computation steps for all words, irrespective of their hardness. The secondary one comes from the lack of sequential information (e.g., word order) that is inherently important for natural language. In this paper, we try to address these issues and propose a depth-adaptive mechanism for the S-LSTM, which allows the model to learn how many computational steps to conduct for different words as required. In addition, we integrate an extra RNN layer to inject sequential information, which also serves as an input feature for the decision of adaptive depths. Results on the classic text classi-fication task (24 datasets in various sizes and domains) show that our model brings significant improvements against the conventional S-LSTM and other high-performance models (e.g., the Transformer), meanwhile achieving a good accuracy-speed trade off.

1 Introduction

Recent advances of graph recurrent network (GRN) have shown impressive performance in many tasks, including sequence modeling (Zhang et al., 2018), sentence ordering (Yin et al., 2019), machine translation (Beck et al., 2018; Guo et al., 2019b), and

Figure 1: Process of recurrent state transition in the SLSTM. Given an input sentence with n words, in each layer, the word takes information from its predecessor , successor , the global node g and itself to update its hidden state (solid lines). Meanwhile, the global node g takes all local states including itself from the previous layer as context vectors to update global state (dashed line). Both update operations take place simultaneously, and layer-wise parameters are shared.

spoken language understanding (Liu et al., 2019). Among these neural networks, the representative S-LSTM has drawn great attention for its high effi-ciency and strong representation capabilities. More specifically, it views a sentence as a graph of word nodes, and performs layer-wise recurrent steps between words simultaneously, rather than incrementally reading a sequence of words in a sequential manner (e.g., RNN). Besides the local state for each individual word, the S-LSTM preserves a shared global state for the overall sentence. Both local and global states get enriched incrementally by exchanging information between each other. A visual process of recurrent state transition in the S-LSTM is shown in Figure 1.

In spite of its successes, there still exist several limitations in the S-LSTM. For example, given a sentence, certain words are usually more ambiguous than others. Considering this example ‘The film was awesome ...’, whether ‘awesome’ means thrilling or excellent is a confusion, thus

more contexts should be taken and more layers of abstraction are necessary to refine feature representations. One possible solution is to simply train very deep networks over all word positions, irrespective of their hardness, that is exactly what the conventional S-LSTM does. However, in terms of both computational efficiency and ease of learning, it is preferable to allow model itself to ‘ponder’ and ‘determine’ how many steps of computation to take at each position (Graves, 2016; Dehghani et al., 2019). In this paper, we focus on addressing the above issue in the S-LSTM, and propose a depth-adaptive mechanism that enables the model to adapt depths as required. Specifically, at each word position, the executed depth is firstly determined by a specific classifier with corresponding input features, and proceeds to iteratively refine representation until reaching its own executed depth. We also investigate different strategies to obtain the depth distribution, and further endow the model with depth-specific vision through a novel depth embedding. Additionally, the parallel nature of the S-LSTM makes it inherently lack in modeling sequential information (e.g., word order), which has been shown a highly useful complement to the no-recurrent models (Chen et al., 2018; Wang et al., 2019). We investigate different ways to integrate RNN’s inductive bias into our model. Empirically, our experiments indicate this inductive bias is of great matter for text representations. Meanwhile, the informative representations emitted by the RNN are served as input features to calculate the executed depth in our depth-adaptive mechanism. To evaluate the effectiveness and efficiency of our proposed model, we conduct extensive experiments on the text classification task with 24 datasets in various sizes and domains. Results on all datasets show that our model significantly outperforms the conventional S-LSTM, and other strong baselines (e.g., stacked Bi-LSTM, the Transformer) while achieving a good accuracy-speed trade off. Additionally, our model achieves state-of-the-art performance on 16 out of total 24 datasets. Our main contributions are as follows1:

• We are the first to investigate a depth-adaptive mechanism on graph recurrent network, and significantly boost the performance of the representative S-LSTM model.

• We empirically verify the effectiveness and necessity of recurrent inductive bias for the S-LSTM.

• Our model consistently outperforms strong baseline models and achieves state-of-the-art performance on 16 out of total 24 datasets.

• We conduct thorough analysis to offer more insights and elucidate the properties of our model. Consequently, our depth-adaptive model achieves a good accuracy-speed trade off when compared with full-depth models.

2 Background

Formally, in the l-th layer of the S-LSTM, hidden states and cell states can be denoted by:

where () is the hidden state for the i-th word, and is the hidden state for the entire sentence. Similarly for cell states . Note that n is the number of words for a sentence, and the 0-th and (n+1)-th words are padding signals.

As shown in Figure 1, the states transition from to consists of two parts: (1) word-level transition from ; (2) sentence-level tran- sition from . The former process is computed as follows:

Figure 2: Overview of our proposed model (left part), whose executed depth is varying at different word positions. The dashed nodes indicate that their sates are directly copied from lower layers without computation. In addition, we introduce Bi-RNN (orange squares in the right part) at the bottom layer for two usages: (1) providing sequential information for upper modules, and (2) serving as input features for the calculation of executed depths.

where is the concatenation of hidden states in a window, and and are forget gates for left , right , corresponding and sentence-level cell state and are input and output gates. The value of all gates are normalised such that they sum to and ) are model parameters.

Then the state transition of sentence-level is computed as follows:

where are normalised gates for con- trolling , respectively. output gate, and model parameters.

3 Model

As the overview shown in Figure 2, our model conducts dynamic steps across different positions, which is more sparse than the conventional SLSTM drawn in Figure 1. We then proceed to more details of our model in the following sections.

3.1 Token Representation

Given an input sentence with n words, we firstly obtain word embeddings from the lookup table initialized by Glove2. Then we train character-level word embeddings from scratch by Convolutional Neural Network (CNN) (Santos and Zadrozny, 2014). The glove and character-level embeddings are concatenated to form the final token representations

3.2 Sequential Module

As mentioned above, the conventional S-LSTM identically treats all positions, and fails to utilize the order of an input sequence. We simply build one layer Bi-LSTMs3 upon the word embedding layer to inject sequential information (right part in

where are parameter sets of Bi-LSTMs. The output hidden states also serve as input features for the following depth-adaptive mechanism.

3.3 Depth-Adaptive Mechanism

In this section, we describe how to dynamically calculate the depth for each word, and use it to control the state transition process in our model. Specifically, for the i-th word () in a sentence, its hidden state is fed to a fully connected feed-forward network (Vaswani et al., 2017) to calculate logits value of depth distribution:

where is a matrix that maps into an inner vector, and is a matrix that maps the inner vector into a L-dimensional vector, and L denotes a predefined number of maximum layer. Then the probability -th depth is computed by softmax:

In particular, we consider three ways to select the depth from the probability

Hard Selection: The most direct way is to choose the number with the highest probability from the depth distribution drawn by Eq. (7):

Soft Selection: A smoother version of selection is to sum up each depth weighted by the corresponding probability. We floor the value considering the discrete nature of the depth distribution by

Gumbel-Max Selection: For better simulating the discrete distribution and more robust depth selection, we use Gumbel-Max (Gumbel, 1954; Mad- dison et al., 2014), which provides an efficient and robust way to sample from a categorical distribution. Specifically, we add independent Gumbel perturbation to each logit drawn by Eq. (6):

where is computed from a uniform random variable is temperature. As samples from the perturbed distribution one-hot, and become uniform as . After that, the exact number of depth is calculated by

modifying the Eq. (7) to:

Empirically, we set a tiny value to , so the depth distribution calculated by Eq. (11) is in the form of one-hot. Note that Gumbel perturbations are merely used to select depths, and they would not affect the loss function for training.

After acquiring the depth number for each individual word, additional efforts should be taken to connect the depth number with corresponding steps of computation. Since our model has no access to explicit supervision for depth, in order to make our model learn such relevance, we must inject some depth-specific information into our model. To this end, we preserve a trainable depth embedding whose parameters are shared with the in the above feed-forward network in Eq. (6). We also sum a sinusoidal depth embedding with for the similar motivation with the Transformer (Vaswani et al., 2017):

where d is the depth, is the the dimension of the depth embedding, and j is index of As thus, the final token representation described by Eq. (4) is refined by:

Then our model proceeds to perform dynamic state transition between words simultaneously. More specifically, once a word reaches its own maximum layer , it will stop state transition, and simply copy its state to the next layer until all words stop or the predefined maximum layer L is reached. Formally, for the i-th word, its hidden state is updated as follows:

where refers to the number of current layer, and is the maximum depth in current sentence. Specially, is initialized by a lin- ear transformation of the inner vector4 in Eq. (6). is the state transition function drawn by Eq. (2). As the global state g is expected to

Table 1: Dataset statistics. ‘CV’ means that there was no standard train/test split and thus 10-fold CV was used. ‘’: There are 16 subsets with the same size in Amazon-16, named as Apparel, Baby, Books, Camera, DVD, Electronics, Health, IMDB, Kitchen, Magazines, MR, Music, Software, Sports, Toys and Video.

encode the entire sentence, it conducts by default, which is drawn by Eq. (3).

3.4 Task-specific Settings

After dynamic steps of computation among all nodes, we build task-specific models for the clas-sification task. The output hidden states of the final layer are firstly reduced by max and mean pooling. We then take the concatenations of these two reduced vectors and global states to form the final feature vector v. After the ReLU activation, v is fed to a softmax clas-sification layer. Formally, the above-mentioned procedures are computed as follows:

where is the probability distribution over the label set, and and are trainable parameters. Afterwards, the most probable label is chosen from the above probability distribution drawn by Eq. (15), computed as:

For training, we denote as golden label for the i-th sample, and |S| as the size of the label set, then the loss function is computed as cross entropy:

4 Experiments

4.1 Task and Datasets

Text classification is a classic task for NLP, which aims to assign a predefined category to free-text documents (Zhang et al., 2015), and is generally evaluated by accuracy scores. Generally, The number of categories may range from two to more, which correspond to binary and fine-grained classi-fication. We conduct extensive experiments on the 24 popular datasets collected from diverse domains (e.g., sentiment, question), and range from modestly sized to large-scaled. The statistics of these datasets are listed in Table 1.

4.2 Implementation Details

We apply dropout (Srivastava et al., 2014) to word embeddings and hidden states with a rate of 0.3 and 0.2 respectively. Models are optimized by the Adam optimizer (Kingma and Ba, 2014) with gradient clipping of 5 (Pascanu et al., 2013). The initial learning rate is set to 0.001, and decays with the increment of training steps. For datasets without standard train/test split, we adopt 5-fold cross validation. For datasets without a development set, we randomly sample 10% training samples as the development set 5 . One layer CNN with a filter of size 3 and max pooling are utilized to generate 50d character-level word embeddings. The novel depth embedding is a trainable matrix in 50d. The cased 300d Glove is adapted to initialize word embeddings, and keeps fixed when training. We conduct hyper-parameters tuning to find the proper value of layer size L (finally set to 9), and empirically set hidden size to 400 6, temperature

Table 2: Accuracy scores (%) on the Amazon-16 datasets. is our implementations with several recent advanced techniques (e.g., label smoothing) under the unified setting. We establish new state-of-the-art results on 12 of total 16 datasets, and outperform the existing highest average score (+0.7%).

Table 3: Accuracy scores (%) on modestly sized and large-scaled datasets. is our implementations with several recent advanced techniques and analogous parameter sizes. Our model achieves new state-of-the art results on 4 of 8 datasets under the same settings.

4.3 Main Results

Please note that current hot pre-trained language models (e.g., BERT (Devlin et al., 2019), XLNet (Yang et al., 2019)) are not directly comparable with our work due to their huge additional corpora. We believe further improvements when utilizing these orthogonal works.

Results on Amazon-16. The results on 16 Amazon reviews are shown in Table 2, where our model achieves state-of-the-art results on 12 datasets, and reports a new highest average score. The average score gains over 3-layer stacked Bi-LSTMs (+1.8%), and the S-LSTM (+1.3%) are also notable. Strong baselines such as Star-Transformer (Guo et al., 2019a) and Recurrently Controlled Recurrent Networks (RCRN) (Tay et al., 2018) are also outperformed by our model.

Table 4: Ablation experiments on IMDB test set. We run each model variant for three times and report the mean and standard deviation. ‘’ denote relative improvements of accuray and speed over ‘Ours’ respectively. e.g., the ‘+6’ in ‘’ denotes the variant processes 6 more samples than ‘Ours’ per second.

Figure 3: Accuracy and speed 7 for full-depth (red lines) and adaptive-depth (blue lines) models on the IMDB test set, where X-axis refer to the maximum of layer L, and accuracy/speed are drawn by solid/dashed lines, respectively.

Results on larger benchmarks. From the results on larger corpora listed in Table 3, we also observe consistent and significant improvements over the conventional S-LSMT (+1.1%) and other strong baseline models (e.g., the transformer (+2.9%), the star-transformer (+3.0%)). More notably, the superiority of our model over baselines are more obvious with the growth of corpora size. Given only training data and the ubiquitous word embeddings (Glove), our model achieves state-of-the-art performance on the TREC, IMDB, AGs News and Yelp Full datasets, and comparable results on other sets.

5 Analysis

We conduct analytical experiments on a modestly sized dataset (i.e., IMDB) to offer more insights and elucidate the properties of our model.

5.1 Compared with Full-depth Model

In our model, the depth is dynamically varying at each word position, and thus it is intuitive to compare the performance with a full-depth model in both terms of accuracy and speed. For fair comparisons, we conduct two groups of experiments on the IMDB test set only with difference in using adaptive-depth mechanism or not. As shown in Figure 3, when , the full-depth model consistently outperforms our depth-adaptive model, due to the insufficient modeling in the lower layers. We also observe the accuracy gap gradually decreasing with the growth of layer number. As , both models perform nearly identical results, but the evident superiority appears when we focus on the speed. Concretely, the speed of full-depth model decays almost linearly with the increase of depths. Howerver, our depth-adaptive model shows a more flat decrease against the increase of depths. Specifically, at the 9-th layer, our model performs 3faster than the full-depth model, which amounts to the speed of a full-depth model with 5 layers, namely only half parameters.

5.2 Ablation Experiments

We conduct ablation experiments to investigate the impacts of our depth-adaptive mechanism, and different strategies of depth selection and how to inject sequential information.

As listed in Table 4, the adaptive depth mechanism has a slight influence on performance, but is of great matter in terms of speed (row 1 vs. row 0), which is consistent with our observations in Section 5.1.

Results in terms of injecting sequential information is shown from row 2 to 4 in Table 4. Although the additional Bi-LSTMs layer decreases the speed to some extend, its great effect on accuracy indi-

Figure 4: Histogram of executed depths over words of two random examples in the IMDB sentiment dataset with negative (a) and positive (b) labels, respectively.

cates this recurrent inductive bias is necessary and effective for text representation. Two position embedding alternatives (row 3 and 4) could also alleviate the lack of sequential information to a certain extent and meanwhile get rid of the time-inefficient problem of RNN (row 2).

In respect of depth selection (row 5 and 6), the Gumbel-Max technique provides a more robust depth estimation, compared with direct (hard or soft) selections.

5.3 Case Study

We choose two examples from the IMDB train set with positive and negative labels, and their depth distributions are shown in Figure 4. Our model successfully pays more attentions to words (e.g., ’thriller’, ’twisted’) that are relatively more difficult to learn, and allocates fewer computation steps for common words (e.g., ’film’, ’and’).

6 Related Work

Extensions of the S-LSTM. Guo et al. (2019a) enhance neural gates in the S-LSTM with selfattention mechanism (Vaswani et al., 2017), and propose the Star-Transformer, which has shown promising performance for sentence modeling. Song et al. (2018) extend the conventional S-LSTM to the graph state LSTM for N-ary Relation Extraction. Inspired by the rich nodes communications in the S-LSTM, Guo et al. (2019b) propose the extend Levi graph with a global node. Different from these work, we mainly focus on the problem of computational efficiency in the S-LSTM, and thus propose a depth-adaptive mechanism. Extensive experiments suggest our method achieves a good accuracy-speed trade off.

Conditional Computation. Our work is inspired by conditional computation, where only parts of the network are selectively activated according to gating units (Bengio et al., 2013) or a learned policy (Bengio et al., 2015). A related architecture, known as Adaptive Computation Time (ACT) (Graves, 2016), employs a halting unit upon each word when sequentially reading a sentence. The halting unit determines the probability that computation should continue or stop step-by-step. ACT has been extended to control the layers of the Residual Networks (Figurnov et al., 2017) and the Universal Transformer (Dehghani et al., 2019). Unlike the continuous layer-wise prediction to determine a stop probability in the ACT, we provide an effective alternative method with more straightforward modeling, which directly predicts the depth distribution among words simultaneously. Another concurrent work named ‘Depth-Adaptive Transformer’ (Elbayad et al., 2019) proposes to dynamically reduce computational burdens for the decoder in the sequence-to-sequence framework. In this paper, we investigate different ways to obtain the depths (e.g., Gumbel-Max), and propose a novel depth embedding to endow the model with depth-specific view. Another group of work explores to conduct conditional computation inside the dimension of neural network representations(Jernite et al., 2017; Shen et al., 2019), instead of activating partial layers of model, e.g., adaptive depths in our method.

7 Conclusion

We propose a depth-adaptive mechanism to allow the model itself to ‘ponder’ and ‘determine’ the number of depths for different words. In addition, we investigate different approaches to inject sequential information into the S-LSTM. Empirically, our model brings consistent improvements in terms of both accuracy and speed over the conventional S-LSTM, and achieves state-of-the-art results on 16 out of 24 datasets. In the future, we would like to extend our model on some generation tasks, e.g., machine translation, and investigate how to introduce explicit supervision for the depth distribution.

Acknowledgments

Liu, Chen and Xu are supported by the National Natural Science Foundation of China (Contract 61370130, 61976015, 61473294 and 61876198), and the Beijing Municipal Natural Science Foundation (Contract 4172047), and the International Science and Technology Cooperation Program of the Ministry of Science and Technology (K11F100010). We sincerely thank the anonymous reviewers for their thorough reviewing and valuable suggestions.

References

Reinald Kim Amplayo, Kyungjae Lee, Jinyeong Yeo, and Seung-won Hwang. 2018. Translations as additional contexts for sentence classification. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.

Daniel Beck, Gholamreza Haffari, and Trevor Cohn. 2018. Graph-to-sequence learning using gated graph neural networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.

Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2015. Conditional computation in neural networks for faster models. arXiv.

Yoshua Bengio, Nicholas Lonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv.

Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, and et al. 2018. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and ukasz Kaiser. 2019. Universal transformers. In Proceedings of the Seventh International Conference on Learning Representations.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics.

Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2019. Depth-adaptive transformer. arXiv preprint arXiv:1910.10073.

Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and Ruslan Salakhutdinov. 2017. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org.

Alex Graves. 2016. Adaptive computation time for re- current neural networks. arXiv.

Emil Julius Gumbel. 1954. Statistical theory of extreme values and some practical applications. NBS Applied Mathematics Series.

Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019a. Startransformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).

Zhijiang Guo, Yan Zhang, Zhiyang Teng, and Wei Lu. 2019b. Densely connected graph convolutional networks for graph-to-sequence learning. Transactions of the Association for Computational Linguistics, 7:297–312.

Sepp Hochreiter and Jrgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.

Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Variable computation in recurrent neural networks. In Proceedings of the Fifth International Conference on Learning Representations.

Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.

Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv.

Hung Le, Truyen Tran, and Svetha Venkatesh. 2019. Learning to remember more with less memorization. In Proceedings of the Seventh International Conference on Learning Representations.

Xin Li and Dan Roth. 2002. Learning question clas- sifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-task learning for text classifica-tion. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.

Yijin Liu, Fandong Meng, Jinchao Zhang, Jie Zhou, Yufeng Chen, and Jinan Xu. 2019. CM-net: A novel collaborative memory network for spoken language understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China. Association for Computational Linguistics.

Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. arXiv.

Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human language technologies-volume 1, pages 142–150. Association for Computational Linguistics.

Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling. In Advances in Neural Information Processing Systems.

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems.

Bo Pang and Lillian Lee. 2004. A sentimental educa- tion: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.

Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit- ing class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. The journal of machine learning research, pages 1310–1318.

Chao Qiao, Bo Huang, Guocheng Niu, Daren Li, Daxi- ang Dong, Wei He, Dianhai Yu, and Hua Wu. 2018. A new method of region embedding for text classi-fication. In Proceedings of the Sixth International Conference on Learning Representations.

C´ıcero Nogueira Dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31th international conference on Computational linguistics.

Dinghan Shen, Guoyin Wang, Wenlin Wang, Mar- tin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.

Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2019. Ordered neurons: Integrating tree structures into recurrent neural networks. In Proceedings of the Seventh International Conference on Learning Representations.

Linfeng Song, Yue Zhang, Zhiguo Wang, and Daniel Gildea. 2018. N-ary relation extraction using graphstate lstm. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.

Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Recurrently controlled recurrent networks. In Advances in Neural Information Processing Systems.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems.

Baoxin Wang. 2018. Disconnected recurrent neural networks for text categorization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.

Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2019. R-transformer: Recurrent neural network enhanced transformer. arXiv preprint arXiv:1907.05572.

Min Yang, Wei Zhao, Jianbo Ye, Zeyang Lei, Zhou Zhao, and Soufei Zhang. 2018. Investigating capsule networks with dynamic routing for text classi-fication. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.

Yongjing Yin, Linfeng Song, Jinsong Su, Jiali Zeng, Chulun Zhou, and Jiebo Luo. 2019. Graph-based neural sentence ordering. In Proceedings of the 28th International Joint Conference on Artificial Intelligence.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas-sification. In Advances in Neural Information Processing Systems, pages 649–657.

Yue Zhang, Qi Liu, and Linfeng Song. 2018. Sentence- state LSTM for text representation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015. Self-adaptive hierarchical sentence model. In Proceedings of the 24th International Joint Conference on Artificial Intelligence.