Investigating Typed Syntactic Dependencies for Targeted Sentiment Classification Using Graph Attention Neural Network
2020·Arxiv
Abstract

Targeted sentiment classification predicts the sentiment polarity on given target mentions in input texts. Dominant methods employ neural networks for encoding the input sentence and extracting relations between target mentions and their contexts. Recently, graph neural network has been investigated for integrating dependency syntax for the task, achieving the state-of-the-art results. However, existing methods do not consider dependency label information, which can be intuitively useful. To solve the problem, we investigate a novel relational graph attention network that integrates typed syntactic dependency information. Results on standard benchmarks show that our method can effectively leverage label information for improving targeted sentiment classification performances. Our final model significantly outperforms state-of-the-art syntax-based approaches.

Targeted sentiment classification [Jiang et al., 2011; Dong et al., 2014; Vo and Zhang, 2015; Zhang et al., 2016; Wang et al., 2018a] is the task of predicting the sentiment polarity for target entity mentions in a given sentence. For example, suppose that a sentence is “I like the food here, but the service is terrible.” and the given targets are “food” and “service”. The output sentiment polarities on the two targets are positive and negative, respectively. Different from text level sentiment, targeted-sentiment is entity-centric, and therefore can offer more fine-grained opinion mining from text documents. Dominant methods for targeted sentiment employ neural network models to encode the sentence and target mention [Wang et al., 2016; Wang et al., 2018b]. Gates [Vo and Zhang, 2015; Tang et al., 2016a; Zhang et al., 2016], convolutions [Li et al., 2018b; Huang and Carley, 2018], attention [Ma et al., 2017a; Tang et al., 2019] and memory network [Tang et al., 2016b; Chen et al., 2017; Wang et al., 2018b] have been exploited to capture the relation between the target mention and its context information. The assumption is that deep syntactic and semantic features can be represented by neural encoding.

image

Figure 1: (a) An example sentence with a dependency tree, (b,c) Two sentences which have similar dependency trees. The target mentions are underlined.

With the advance of structured neural encoders, syntactic structures predicted by external parsers have been shown useful for the task. Intuitively, syntactic structures such as dependency syntax can help better encode the correlation between a target mention and the relevant sentiment keywords. Recent methods consider the dependency trees as adjacency matrices, using graph neural networks such as graph convolutional networks (GCN; Kipf and Welling [2017]) and graph attention network (GAT; Veliˇckovi´c et al. [2018]) to encode the input sentence according to such matrices [Sun et al., 2019; Huang and Carley, 2019]. As shown in Figure 1(a), such dependency tree structures can help bring target mentions closer to its relevant contexts, thereby facilitates feature representation. In the given sentence, the relevant sentiment word “good” is distant from the target mention “Chinese dumplings” in the surface string, but close in the dependency tree (i.e., “Chinese dumplings”

←−“taste”xcomp−→“good”).

While being more effective compared with the state-of-the-art approaches which do not use syntactic structures, these methods do not consider dependency labels, which can potentially be useful for sentiment disambiguation. As shown in Figure 1(b) and Figure 1(c), while “Arsenal” has similar dependency arc relations with the word “defeated” in both sentence, the sentiment polarities are different. Apparently differentiating the label types can help differentiate such cases. On the other hand, considering such information can make the syntactic structure more sparse and thus increase the difficulty in learning. Thus it remains a open research question: how to effectively make use of such fine-grained

syntax features for better targeted sentiment classification.

We investigate a graph attention network to integrate such typed dependency features. In particular, the proposed model incorporates label features into the attention mechanism, using the extended attention function to guide information propagation from a target mention’s syntactic context to target mention itself. We name our model relational graph attention network (RGAT). With the help of these label features, our model can better capture the relationship between words, thus addressing the problems shown in Figure 1. Moreover, the dependency labels can serve as additional features, enriching the representation of words.

Experiments over three standard benchmarks show that using GAT to encode the input gives better results compared to a Transformer encoder, which coincides with recent observation that syntax is useful for targeted sentiment classifications [Sun et al., 2019; Huang and Carley, 2019]. Further, adding typed dependency features gives consistently better results compared with a baseline without such information, thereby proving our research hypothesis. Our model gives significantly better results than state-of-the-art syntax-based work on the standard Laptop, Restaurant and Twitter datasets. To our knowledge, we are the first to consider full dependency syntax for targeted sentiment classification. We release our code at xxx.

Existing work on targeted sentiment classification can be classified into two main lines, namely those methods that do not rely on external syntax information and those using syntax structures. Most work along the first line split a given input sentence into two sections, including the target mention and its context, extracting features from each section and combining the features for making prediction. For example, Vo and Zhang [2015] use word2vec embeddings and pooling mechanism for extracting features from target mention and its left and right context, respectively, before concatenating all the features for classification. Zhang et al. [2016] use gated recurrent neural network for extracting features from target mentions and its context, before further defining a gate to integrate such features. Tang et al. [2016a] exploit long-short memory network (LSTM, Hochreiter and Schmidhuber [1997]) structures for feature encoding, and combine the hidden states of target word and context for classification. Xue and Li [2018] employ a CNN layer to extract features from the hidden states originated from a bidirectional RNN layer. Huang and Carley [2018] adopt CNN to capture the aspect-specific features. Besides, attention mechanism is also proven to be useful. Ma et al. [2017a] use a bidirectional attention network, which learns the attention weights on target mention toward a average of its context vectors, to model the target-context relationship. Li et al. [2018a] further improve the attention-based models with position information. Tang et al. [2019] introduce a self-supervised attention learning approach, which automatically mines useful supervision information to refine attention mechanisms. Finally, memory networks have also been applied to this task. Tang et al. [2016b] firstly develop a deep memory network (MemNet), which uses pretrained word vectors as memory and exploits attention mechanism to update the memory. Chen et al. [2017] improve MemNet by taking the hidden states generated by LSTM as memory and adopting GRU to update the representation of target mentions. Wang et al. [2018b] deploy a targeted sensitive memory network for better information integration.

Among work that uses syntax structures, Dong et al. [2014] use a recursive neural network to encode a syntax tree, transformed by placing the target mention onto the root node. Recent work uses graph neural networks to encode the syntactic structure, obtaining better results. In particular, Sun et al. [2019] adopted GCN while Huang and Carley [2019] integrate LSTM with GAT for encoding. Both work exploit dependency syntax for encoding, but neither considers dependency label information, which we demonstrate important. Our work is along this line. We take GAT as our base model and investigate a relational extension. A similar work to this end is Shaw et al. [2018], who extends a self-attention network (SAN) by integrating relation information and relative position embeddings. However, they observe negative results by integrating label features for machine translation. In contrast to their work, we investigate relational GAT for targeted sentiment classification, with significantly improved results. To our knowledge, we are the first to consider integrating typed syntactic dependency information into GAT.

For our model, each instance consists of three components: a target mention, a sentence and a dependency tree of the sentence. Formally, we denote these components as a triplet: (S, Sτ, T), where  S = {w1, w2, ...wi, ..., wi+m, ...wn}denotes the input sentence,  Sτ = {wi, wi+1, ..., wi+m−1}denotes the target mention word sequence, and T denotes the dependency tree over S. The lengths of S and  Sτare n and m, respectively. The dependency tree can be represented as a graph T = (V, A, R), where V includes all words {w1, w2, ..., wn}, Ais a adjacent matrix  A ∈ Rn×nwith Aij = 1if there is a dependency arc relation between word wiand  wj, and  Aij = 0otherwise, 1 and R is a relation matrix, where  Rijcorresponds to the label of  Aij. The goal of targeted sentiment classification is to predict the sentiment polarity  y ∈ {1, −1, 0}of a sentence S over the target mention  Sτ, where  1, −1and 0 denote “positive”, “negative” and “neutral”, respectively.

The overall architecture of our model is shown in Figure 2. It is mainly composed of three components: the Input Layer, the Relational Graph Attention Layer and the Output Layer.

3.1 Input Layer

We consider two types of word embeddings, namely GloVe embeddings [Pennington et al., 2014] and BERT

image

Figure 2: The proposed model framework.

embeddings [Devlin et al., 2019], which are representations for traditional and contextual embeddings, respectively. Glove Embeddings Given a sentence  S = {w1, w2, ..., wn}with a target mention, we conduct a look up operation over a GloVe word embedding matrix to obtain a vector  vi ∈Rdwfor each word  wi, where  dwis the dimension of word embeddings.

Following previous work [Sun et al., 2019], we also use POS-tag embeddings  ti ∈ Rdtand position embeddings  pi ∈Rdpas additional inputs.  dtand  dpdenote the dimensions of POS-tag and position embeddings, respectively. Therefore, the final representation of a word  wiis  xi = [vi; ti; pi], which denotes the concatenation of  vi, tiand  pi.

To capture rich features for each word, we use BiLSTM layer to model the bidirectional context information. Formally, given an word embedding sequence x = {x1, x2, ..., xn}, a forward −−−−→LSTMgenerates a set of hidden states  {−→h01,−→h02, ...,−→h0n}, and a backward  ←−−−−LSTMgenerates a set of hidden states  {←−h01,←−h02, ...,←−h0n}. Finally, the output hidden states  {h01, h02, ..., h0n}are obtained by concatenating the corresponding forward and backward hidden states. BERT Embeddings We adopt BERT2 to provide a contextual embedding  xifor each word  wi. To facilitate the training and fine-tuning processes of the BERT model, we refactor the sequence as “[CLS]” + sentence + “[SEP]” + target mention + “[SEP]” as input to BERT. Similar to Glove embeddings, we also adopt POS-tag embeddings and position embeddings as additional inputs. For the BERT-based model, the bidirectional LSTM module is unnecessary as the BERT embeddings already contain contextual features, due to better model-design and large-scale training corpus.

3.2 Relational Graph Attention Network

The graph attention network layer learns a deep representation over the input layer representation, integrating target and context features. Our relational graph attention network (RGAT) aims to perform information exchange among words according to the syntactic dependency paths. In this section, we start by introducing the baseline GAT model which takes G = (V, A) as input, and then present the proposed model, which can make use of extra relation features for better feature representation.

Vanilla Graph Attention Network

Graph attention networks are a variant of graph neural networks, which leverage masked self-attention layers to encode the graph structures.

Layer input and output Given a set of word embeddings {v1, v2, ..., vn}, a GAT takes them as  {h01, h02, ..., h0n}, iteratively producing more abstract features  {hl1, hl2, ..., hln}with increasing l. The lth GAT layer takes predecessor word features  {hl−11 , hl−12 , ..., hl−1n }and an adjacent matrix A as input, and produces a new set of word features {hl1, hl2, ..., hln}as its output.

Feature aggregation Given a word  wiwith its representation hl−1iand its neighbors  wj ∈ Ni, a GAT updates the word’s representation at layer l by calculating a weighted sum of the neighbor states. Briefly, the aggregation process of a multi-head-attention-based GAT can be described as:

image

where || denotes vector concatenation,  W lzV ∈ RdZ ×dis a parameter matrix of kth head at layer l, d denotes the dimension of word feature vectors, Z is the number of attention heads, and  σrepresents the sigmoid activation function.

The weight  αlzijin above equation is calculated via an attention process, which models the importance of each  hlzjto  hlzi:

image

where F is an attention function. In this work, we use the scaled dot-product attention function [Vaswani et al., 2017]3:

image

where  W lzK , W lzV ∈ RdZ ×dare parameter matrices of the kth head in layer l.

Apart from the aforementioned layer, we further add a point-wise convolution transformation (PCT) layer following the attention layer, which gives each node more information capacity. The convolution kernel size is 1, and convolution is applied to every single token belonging to the input. Given an input sequence  h = {hl1, hl2, ..., hln}, PCT is defined as:

image

where  δrefers to the ReLU activation function,  ⋆denotes the convolution operation,  W p1 , W p2 , bp1, bp2are weights and bias of two convolutional kernels.

Relational Graph Attention Network

The vanilla GAT mentioned above uses an adjacent matrix as structural information, thus omitting dependency label features. RGAT is an extension of GAT, which incorporates relational features into the calculation of attention and aggregation process to obtain better representations.

Relations as input Denoting the relation between word  wiand  wjas  Rij, we transform  Rijinto a vector  rij ∈ Rdr, where  dris the dimension of relation embeddings. During the training process, the relation embeddings are jointly optimized with the model.

Relation-aware attention Inspired by previous work on SAN [Shaw et al., 2018], we consider relation features when calculating attentions weights for RGAT. In particular, this intuition is implemented by adding a new term into Equation 3 when calculation weights between  wiand  wj:

image

where  W lKr ∈ RdZ ×dris a parameter matrix, which is shared across multiple attention heads.

Relation-aware feature aggregation Relations can also be important in the feagure aggregation process, as these additional fine-grained information can be used to enrich the feature of each word  hli. To this end, we use the feature of neighbor words  hl−1jtogether with their corresponding relations  rijas inputs to update the representation of  hl−1i:

image

where  W lV r ∈ RdZ ×dris a parameter matrix.In order to learn deep features, we apply a stacked relational graph attention network with multiple layers.

3.3 The output layer

With L relational graph attention layers, we obtain a final feature representation for each word  {hL1 , hL2 , ..., hLn}. A pooling function G is then applied over the target mention

image

Table 1: Statistics of datasets.

vectors to obtain a global representation  hg, which is used to calculate the probability of each sentiment class c:

image

where W and b are tunable model parameters, and C is the set of sentiment classes. We implement G as an average pooling function for aggregating  {hLi , hLi+1, ..., hLi+m−1}in the  hg.

Given a set of training instances  D = {d1, d2, ..., dN}, the training objective of our model is a cross-entropy loss with L2regularization:

image

where I is an indicator function, N is the number of training examples,  λis a regularization hyperparameter and  Θdenotes all parameters in model.

We conduct experiments on 3 benchmark datasets, including the Restaurant reviews (Restaurant) and Laptop reviews (Laptop) datasets of SemEval 2014 [Pontiki et al., 2014], and the ACL14 Twitter dataset from [Dong et al., 2014]. These datasets are labeled with three sentiment polarities: positive, neutral and negative. The number of samples in each category are summarized in Table 1.

4.1 Settings

For fair model comparison, we adopt similar hyperparameters as previous work. In particular, we use the Stanford neural parser4 to obtain dependency trees. 300-dimensional GloVe vectors is adopted for word representation and fixed during the learning process. The dimension of relation embeddings is set as 30. We use 5 heads in our GAT and RGAT models, and the hidden size is set as 100. The dropout rate on input word embeddings is 0.7, and the  L2regularization term λ = 10−5. The Adamax [Kingma and Ba, 2015] optimizer with a learning rate  10−3is adopted to train our models. For the BERT-based model, we fine-tune pre-trained BERT during training. The head number and hidden size are the same as GloVe-based model. The dropout rate on BERT embeddings is 0.1, and regularization term  λ = 10−5. The Adam [Kingma and Ba, 2015] optimizer with learning rate  2∗10−5is adopted for training these models. Evaluation metrics are Accuracy and Macro-Averaged F1, where the latter is more appropriate for datasets with unbalanced classes. Furthermore, pairwise t-test on both Accuracy and MacroAveraged F1 is conducted to verify if the improvement over the compared models is significant.

4.2 Baselines

We compare our model with the state-of-the-art systems. Our baselines include:

- SVM [Kiritchenko et al., 2014], a traditional model with extensive feature engineering.

- AdaRNN [Dong et al., 2014], which learns the sentence representation toward target via RNN semantic composition over a dependency tree.

- IAN [Ma et al., 2017b], which models the representations of the target and context interactively via two LSTMs and attention.

- TNet [Li et al., 2018b], which transforms BiLSTM embeddings into target specific embeddings, and uses a CNN to extract final embedding.

- MGAN [Fan et al., 2018], which exploits a BiLSTM to capture contextual information and a multi-grained attention mechanism to capture the relationship between aspect and context.

- AEN [Song et al., 2019], which adopts attentional encoder network for feature representation and modeling semantic interactions between target and context.

- CDT [Sun et al., 2019], which integrates dependency trees with GCN for aspect representation learning.

- TD-GAT [Huang and Carley, 2019], which uses GAT to capture the syntax structure and improves it with LSTM to model relation across layers.

- BERT-PT [Xu et al., 2019], which uses a post-training approach on a pretrained BERT model pretrained model to improve the performance for review reading comprehension and targeted aspect sentiment classification.

- BERT-SPC [Song et al., 2019], which feeds “[CLS]” + sentence + “[SEP]” + target mention + “[SEP]” into pretrained BERT model, and then uses the pooled embedding for classification.

4.3 Main Results

Table 2 shows the classification results of different models. First, compared with systems without dependency tree feature (SVM, IAN, TNet, MGAN, AEN), our model gives significantly better results. This indicates that the dependency tree can provide useful information for the targeted aspect sentiment classification task, which is consistent with observations by recent work [Sun et al., 2019; Huang and Carley, 2019]. Second, compared with recent work which takes dependency trees but without relation labels as input (CDT, TD-GAT), RGAT consistently outperforms TD-GAT across all the datasets, with an accuracy improvement of 3.6 percent on Laptop. In addition, RGAT (GloVe) gives better results than the state-of-the-art CDT model on Restaurant

image

Table 2: Performance comparison of different models on the benchmark datasets. The best performance are bold-typed, †

denotes that the model requires dependency tree as input. Results underlined indicate that the proposed method is significantly better than baseline at significance level p<0.01.

and Laptop. While getting comparable accuracy with CDT and TNet on Twitter, RGAT (GloVe) achieves better F1 score. Furthermore, according to the results, pretrained BERT model can significantly boost the performance of each model. With the help of BERT, the proposed model achieves better results than all the baselines, with accuracy of 86.59, 81.25 and 75.84 on Restaurant, Laptop and Twitter, respectively.

4.4 Ablation Study

We present an ablation study on the effects of structure information and relation information. In particular, we consider three ablation baselines for comparison, including 1) Transformer, we replace the relational graph attention layers with self-attention layers. This model can be viewed as our model without dependency tree as supervision; and 2) GAT, the relation graph attention layers are substitute by graph attention layers, serving as our model without relation features; 3) GAT-Ratt, we remove the relation-aware feature aggregation module of RGAT to test the effectiveness of the relation-aware attention independently.

Effectiveness of structural information As shown in Table 3, the GAT (G) model consistently gives better accuracy and F1 score over the Transformer (G) model, with 3.0 and 3.2 percent average improvement of accuracy and F1 score, respectively. In addition, the performance of GAT (B) is also better than that of Transformer (B). The above results indicates that syntax is helpful for targeted sentiment classification. A possible reason for why improvement on BERT is relatively small is that structural information is contained in the BERT representations to some extent because of contextual language modeling [Jawahar et al.,

image

Table 3: An ablation study on the restaurant and laptop datasets. The best performance are bold-typed. G,B denote GloVe and BERT, respectively. Ratt refers to relational attention.

image

Figure 3: The effects of model depth

2019].

Effectiveness of relation-aware attention GAT-Ratt consistently shows better performance over GAT. Specifically, GAT (B)-Ratt improves GAT (B) by 0.8 percent of accuracy on Restaurant and 0.6 percent of F1 score on Laptop, respectively. Such results indicate that the relation-aware attention is useful for targeted sentiment analysis.

Effectiveness of RGAT for typed dependencies RGAT gives better results than GAT and GAT-Ratt on both datasets. RGAT (G) improves GAT-Ratt (G) by 0.7 percent and 1.0 percent on accuracy and F1, respectively. On average, the accuracy of GAT (G) increases by 1.0 percent and 0.65 percent on Restaurant and Laptop, respectively. The improvement on F1 score reach 1.3 percent and 1.6 percent, respectively. Similar trends can also be observed when comparing RGAT (B) with GAT (B). Such results demonstrate that the relation-aware feature aggregation can further enhance the model capacity, and the dependency labels can bring significant improvement to the model, whether taking GloVe or BERT as input.

4.5 Impacts of Model Depth

Figure 3 shows the accuracy curves for RGAT-GloVe and RGAT-BERT on the Restaurant. Different numbers of layers ranging from 1 to 8 are considered. For RGAT-GloVe, the initial accuracy is very low and then increases along with the depth, reaching the best score of 83.45 with 6 layers. This is intuitive as the target-related sentiment words can be many-hops away from the target mention, which means that multiple layers of node communication is necessary for passing information from relevant context to the target mention. In contrast to GloVe, RGAT-BERT reaches the best

image

Figure 4: Several samples, the numbers denote the attention weights given by RGAT (red) and GAT (blue). The target mentions are underlined.

accuracy faster with 4 layers, which can be because syntax contained in BERT allows faster information propagation.

4.6 Case Study

We provide 4 examples to further analyze the proposed model. The attention scores of edges are given for comparison.5 Consider the case shown in Figure 4(a) and Figure 4(b), GAT gives a right sentiment in case 1 and a wrong sentiment in case 2, while RGAT obtains right sentiment in both cases, which indicates that RGAT can differentiate similar structures, thanks to the dependency labels. In Figure 4(c), GAT predicts a “positive” sentiment for target mention “scallops”, and the attention weight on edge “as”−→“scallops” is high, which demonstrates that GAT is negatively influenced by “well” which does not have positive meaning in this case. By contrast, RGAT gives the correct result and the attention weight on edge “as”cc−→“scallops” is close to zero. Figure 4(d) shows another case where the decision of GAT is influenced by an irrelevant word “bad”. The GAT’s attention weight on edge “get”cc−→“menu” is 0.8 while that of RGAT is 0.5. The last two examples indicate that the dependency labels have a positive influence on modeling the relationship between two words.

We investigated the use of typed dependency structures for targeted sentiment classification, by extending a graph attention network encoder with relation features. Extensive experiments on three standard benchmarks show that label information is useful for sentiment classification, and our relational GAT model can effectively encode such features. Our final model gives results that are highly competitive to the best results in the literature. To our knowledge, we are the first to consider integrating relations into GAT and the first to consider typed dependencies for targeted sentiment classification.

[Chen et al., 2017] Peng Chen, Zhongqian Sun, Lidong Bing, and Wei Yang. Recurrent attention network on memory for aspect sentiment analysis. In EMNLP, 2017.

[Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In NAACL, 2019.

[Dong et al., 2014] Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. Adaptive recursive neural network for target-dependent twitter sentiment classification. In ACL, 2014.

[Fan et al., 2018] Feifan Fan, Yansong Feng, and Dongyan Zhao. Multi-grained attention network for aspect-level sentiment classification. In EMNLP, 2018.

[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. NC, 1997.

[Huang and Carley, 2018] Binxuan Huang and Kathleen Carley. Parameterized convolutional neural networks for aspect level sentiment classification. In EMNLP, 2018.

[Huang and Carley, 2019] Binxuan Huang and Kathleen Carley. Syntax-aware aspect level sentiment classification with graph attention networks. In EMNLP-IJCNLP, 2019.

[Jawahar et al., 2019] Ganesh Jawahar, Benoˆıt Sagot, and Djam´e Seddah. What does BERT learn about the structure of language? In ACL, 2019.

[Jiang et al., 2011] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. Target-dependent twitter sentiment classification. In ACL, 2011.

[Kingma and Ba, 2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2015.

[Kipf and Welling, 2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.

[Kiritchenko et al., 2014] Svetlana Kiritchenko, Xiaodan Zhu, Colin Cherry, and Saif Mohammad. NRC-canada-2014: Detecting aspects and sentiment in customer reviews. In SemEval, 2014.

[Li et al., 2018a] Lishuang Li, Yang Liu, and AnQiao Zhou. Hierarchical attention based position-aware network for aspect-level sentiment analysis. In Conll, 2018.

[Li et al., 2018b] Xin Li, Lidong Bing, Wai Lam, and Bei Shi. Transformation networks for target-oriented sentiment classification. In ACL, 2018.

[Ma et al., 2017a] Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. Interactive attention networks for aspect-level sentiment classification. In IJCAI, 2017.

[Ma et al., 2017b] Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. Interactive attention networks for aspect-level sentiment classification. In IJCAI, 2017.

[Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, October 2014.

[Pontiki et al., 2014] Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. SemEval-2014 task 4: Aspect based sentiment analysis. In SemEval, 2014.

[Shaw et al., 2018] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In NAACL, 2018.

[Song et al., 2019] Youwei Song, Jiahai Wang, Tao Jiang, Zhiyue Liu, and Yanghui Rao. Attentional encoder network for targeted sentiment classification. ArXiv, 2019.

[Sun et al., 2019] Kai Sun, Richong Zhang, Samuel Mensah, Yongyi Mao, and Xudong Liu. Aspect-level sentiment analysis via convolution over dependency tree. In EMNLP-IJCNLP, 2019.

[Tang et al., 2016a] Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu. Effective LSTMs for target-dependent sentiment classification. In COLING, 2016.

[Tang et al., 2016b] Duyu Tang, Bing Qin, and Ting Liu. Aspect level sentiment classification with deep memory network. In EMNLP, 2016.

[Tang et al., 2019] Jialong Tang, Ziyao Lu, Jinsong Su, Yubin Ge, Linfeng Song, Le Sun, and Jiebo Luo. Progressive self-supervised attention learning for aspect-level sentiment analysis. In ACL, 2019.

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.

[Veliˇckovi´c et al., 2018] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li`o, and Yoshua Bengio. Graph Attention Networks. ICLR, 2018.

[Vo and Zhang, 2015] Duy-Tin Vo and Yue Zhang. Target- dependent twitter sentiment classification with rich automatic features. In IJCAI, 2015.

[Wang et al., 2016] Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. Attention-based LSTM for aspect-level sentiment classification. In EMNLP, 2016.

[Wang et al., 2018a] Shuai Wang, Sahisnu Mazumder, Bing Liu, Mianwei Zhou, and Yi Chang. Target-sensitive memory networks for aspect sentiment classification. In ACL, 2018.

[Wang et al., 2018b] Shuai Wang, Sahisnu Mazumder, Bing Liu, Mianwei Zhou, and Yi Chang. Target-sensitive memory networks for aspect sentiment classification. In ACL, 2018.

[Xu et al., 2019] Hu Xu, Bing Liu, Lei Shu, and Philip Yu. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In NAACL, 2019.

[Xue and Li, 2018] Wei Xue and Tao Li. Aspect based sentiment analysis with gated convolutional networks. In ACL, 2018.

[Zhang et al., 2016] Meishan Zhang, Yue Zhang, and DuyTin Vo. Gated neural networks for targeted sentiment analysis. In AAAI, 2016.