As the volume of news, blogs [8], and social media [14] far outstrips what an individual can consume, sentiment classification (SC) [15, 12] has become a powerful tool for understanding emotions toward politicians, celebrities, products, governance decisions, etc. Of particular interest is to identify sentiments expressed toward specific entities, i.e., target dependent sentiment classification (TDSC). Recent years have witnessed many TDSC approaches [23, 21, 22] with increasing sophistication and accuracy.
Possibly because research on passage-level and target-dependent sentiment classification were separated in time by the dramatic emergence of deep learning, TDSC systems predominantly use recurrent neural networks (RNNs) and borrow little from passage-level tasks and trained models. From the perspective of curriculum learning [2], this seems suboptimal: representation borrowed from
passage-level SC should inform TDSC well. Moreover, whole-passage labeling entails considerably lighter cognitive burden than target-specific labeling. As a result, whole-passage gold labels can be collected at larger volumes. In this paper, we present MTTDSC, Multi-Task Target Dependent Sentiment Classifier, a novel multi-task learning (MTL) system that uses passage-level SC as an auxiliary task and TDSC as the main task. MTL has shown significant improvements in many fields of Natural Language Processing and Computer Vision. In basic (‘naive’) MTL, we jointly train multiple models for multiple tasks with some shared parameters, usually in network layers closest to the inputs [13], resulting in shared input representation learning. Symmetric, uncontrolled sharing can be detrimental to some tasks. In MTTDSC, the auxiliary SC task uses bidirectional GRUs, whose states are pooled over positions to make whole-passage predictions. This sensitizes the auxiliary GRU to target-independent expressions of sentiments in words. The main TDSC task combines the auxiliary GRUs with its own target-specific GRUs. The two tasks are jointly trained. If passages with both global and target-specific labels are available, they can be shared between the tasks. Otherwise, the two tasks can also be trained on disjoint passages. MTTDSC can be interpreted as a form of task-level curriculum learning [2], where the simpler whole-passage SC task learns to identify sentiments latent in word vectors, which then assists the more challenging TDSC task. Static sentiment lexicons, such as SentiWordNet [6], are often inadequate for dealing with informal media. Using two standard datasets, as well as one new dataset we introduce for the main task, we establish superiority of MTTDSC over several state-of-the-art approaches. While improved accuracy from additional training data may seem unsurprising from a learning perspective, we show that beneficial integration of the auxiliary task and data is nontrivial. Simpler multi-task approaches [13], where a common feature extraction network is used for jointly training on multiple tasks, perform poorly. We also use word-level sensitivity tests to obtain anecdotal evidence that direct TDSC approaches (that do not borrow from whole-passage SC models) make target-specific prediction errors because they misclassify the (target independent) sentiments expressed by words. Thus, MTTDSC also provides a more interpretable model, apart from accuracy gains. The contributions of our work are summarized as follows:
– MTTDSC, a novel neural MTL architecture designed specifically for TDSC. We show the superiority of our model and also compare it with other state-of-the-art models of TDSC and multi-task learning.
– A new dataset for target dependent sentiment classification which is better for real world analysis on social media data.
– Thorough investigation of the reasons behind the success of MTTDSC. In particular, we show that current models fail to capture many emotive words owing to insufficient training data.
MTTDSC code and datasets are available at https://github.com/divamgupta/
Target dependent sentiment classification: An input text passage is a se-
where is a 3-class multinomial distribution over
, obtained from the softmax. Standard cross-entropy against the one-hot gold label is used for training. TCLSTM [20] is a slight modification to TDLSTM, where the authors also concatenated the embedding of the target entity with each token in the given sentence. They showed that TCLSTM has a slight improvement over TDLSTM.
By pooling embeddings of words appearing on dependency paths leading to the target position, TDParse [22] improves further on TDLSTM accuracy. More details of these systems are described in Section 4.4, along with their performance. The major problem in TDParse is the inability to learn compositions of words. TDParse usually fails for the sentences containing a polar word which is not related to the entity.
The “naive segmentation” (Naive-Seg) model of Wang et al. [22] concatenates word embeddings of left context, right context and sub sentences of the tweets. Various pooling functions are used to combine them and an SVM is used for labeling. Naive-Seg+ extends Naive-Seg by using sentiment lexicon based features. TDParse extends Naive-Seg by using dependency parse paths incident on the target entity to collect words whose embeddings are then pooled. TDParse+ further extends TDParse by using sentiment lexicon [6] based features. TDParse+ beats TDLSTM largely because of carefully engineered features (including SentiWordNet based features), but may not generalize to diverse datasets.
Pooling word embeddings over dependency paths may not capture complex compositional semantics. Given enough training data, TDLSTM should capture complex compositional semantics. But in practice, neural sequence models start with word vectors that were not customized for sentiment detection, and then get limited training data.
Multi-task learning: Multi-task learning has been used in many applications
related to NLP. Peng et al. [16] showed improved results in semantic dependency parsing be learning three semantic dependency graph formalisms. Choi et al. [3] improved the performance on question answering by jointly training answer generation and answer retrieval model. Sluice networks proposed by Ruder et al. [18] claims to be a generalized model which could learn to control the sharing of information between different task models. Sluice networks do not perform well for TDSC, as the sharing of information happens at all positions of the sentence. On the other hand, our model forces the auxiliary task to learn feature representation at all positions and share them at the appropriate locations with the main task.
Recurrent models for TDSC have to solve two challenging problems in one shot: identify sentiment-bearing words in the passage, and use influences between hidden states to connect those sentiments to the target. A typical TDSC system attempts to do this as a self-contained learner, without representation support from an auxiliary learner solving the simpler task of whole-passage SC. We present anecdotes in Section 4.5 that reveal the limitations of such approaches. In response, we propose a multi-task learning (MTL) approach called MTTDSC. Representations trained for the auxiliary task (Section 3.1) inform the main task (Section 3.2). The combined loss objective is described in Section 3.3 and implementation details are presented in Section 3.4.
Our MTL framework is significantly different from traditional ones. In particular, we do not require auxiliary and main task gold labels to be available on the same instances. (This makes it easier to collect larger volumes of auxiliary labeled data.) As a result, in standard MTL, attempts to improve auxiliary task performance interferes with the learning of feature representations that are important for the main task. To solve this problem, we use separate RNNs for the two tasks, the output of the auxiliary RNN acting as additional features to the main model. This ensures that the gradients from the auxiliary task loss do not unduly interfere with the weights of the main task RNN.
3.1 Auxiliary task
The network for the auxiliary task is shown at the top of Figure 1. The auxiliary model consists of a left-to-right GRU, a right-to-left GRU
, and a fully-
Fig. 1: MTTDSC network architecture. Passage-level gold labels are used to compute loss in the upper auxiliary network. Target-level gold labels are used to compute loss in the lower main network. These are coupled through tied parameters in auxiliary GRUs. The main task uses another set of task-specific GRUs.
connected layer . The auxiliary model is trained with tweets that are accompanied by whole-tweet sentiment labels from
. First GRU
and GRU
are applied over the entire tweet (positions 1, . . . , N). At every token position i, we construct the concatenation
These are then averaged over positions to get a fixed-size pooled representation ¯of the whole tweet:
Average pooling lets the auxiliary model learn useful features at all positions of the tweet. This helps the primary task, as the target entity of the primary task can be at any position. The whole-tweet prediction output is SoftMax(¯. Again, cross-entropy loss is used.
3.2 Main task
Beyond the auxiliary model components, the main task uses a left-to-right GRU, a right-to-left GRU
, and a fully connected layer
as model components.
Let target entity be at token position i. GRUand GRU
are run over positions 1
1. GRU
and GRU
are run over positions i + 1, . . . , N. The four resulting state vectors GRU
[
1], GRU
[i + 1], GRU
1], and GRU
+ 1] are concatenated into a vector in
, which is input into the fully-connected layer followed by a softmax. SoftMax
GRU
[
1], GRU
1], GRU
[i + 1], GRU
+ 1]
(3) Our network for the situation where the auxiliary and main tasks do not share instances is shown in Figure 1.
3.3 Training the tasks
Suppose the auxiliary task has instances ) : i = 1, . . . , A} and the main task has instances
) : j = 1, . . . , M}. Let GRU
be all the GRU model parameters in {GRU
GRU
GRU
GRU
. Then our overall loss objective is
(4) Standard cross-entropy is used for both lossand loss
. Before training the full objective above, we pre-train only the auxiliary task for one epoch. The situation where instances may be shared between the auxiliary and main tasks is similar, except that GRU cells are now directly shared between auxiliary and main tasks. We anticipate this multi-task setup to do better than, say, fine-tuning word embeddings in TDLSTM, because the auxiliary task is better related to the main task than unsupervised word embeddings. By the same token, we do not necessarily expect our auxiliary learner to outperform more direct approaches for the auxiliary task — its goal is to supply better word/span representations to the main task.
3.4 Implementation details
GRU instead of LSTM: We used GRUs instead of LSTMs, which are more
common in prior work. GRUs have fewer parameters and are less prone to over-fitting. In fact, our TDGRU replacement performs better than TDLSTM (Tables 1 and 2).
Hyperparameters: We set the hidden unit size of each GRU network as 64.
Recurrent dropout probability of the GRU is taken as 0.2. We also used a dropout of 0.2 before the last fully connected layer. For training the models we used the Adam optimizer with learning rate of 0.001, = 0
= 0.999. We used a mini-batch size of 64.
Ensemble: While reimplementing and/or running baseline system codes, we saw
large variance in test accuracy scores for random initializations of the network weights. We improved the robustness of our networks by using an ensemble of the same model trained on the complete dataset with different weight initializations. The output class scores of the final model are the average of the probabilities returned by members of the ensemble. For a fair comparison, we also use the same ensembling for all our baselines.
Word embeddings: MTTDSC, TDLSTM and TCLSTM use GloVe embed-
dings [17] trained on the Twitter corpus.
We summarize datasets, competing approaches, and accuracy measures, followed by a detailed performance comparison and analysis.
4.1 Datasets for auxiliary SC task
Go [7]: This is a whole-passage SC dataset, containing 1.6M tweets automatically annotated using emoticons, highlighting that SC labeling can be easier to acquire at large scale. It has only positive and negative classes.
Sanders [19]: The second dataset is provided by Sanders Analytics and has
5,513 tweets over all 3 classes. These are manually annotated.
4.2 Datasets for main TDSC task
Dong [5]: Target entities are marked in tweets (one target entity per tweet),
and one of three sentiment labels manually associated with each target. The training and test folds contain 6,248 and 692 instances respectively. The class distribution is 25%, 50% and 25% for negative, neutral and positive respectively. Election1 [22]: Derived from tweets about the recent UK election, this dataset
contains 3,210 training tweets that mention 9,912 target entities and 867 testing tweets that mention 2,675 target entities. The class distribution is 45.3%, 36.5%, and 17.7% for negative, neutral and positive respectively, which is highly unbalanced. There are an average of 3.16 target entities per tweet.
Election2: In this paper, we introduce a new TDSC dataset, also based on UK
election tweets. We first curated a list of candidate hashtags related to the UK General Elections, such as #GE2017, #GeneralElection and #VoteLabour. The collection was done during a period of 12 days, from June 2, 2017 through June 14, 2017. After removing retweets and duplicates (tweets with the same text), we ended up with 563,812 tweets. After running the named entity tagger, we observed that 158,978 tweets (28.19%) had at least one named entity, 38,809 tweets (6.88%) had at least two named entities and the remaining 7,992 tweets (1.42%) had three or more named entities. We took all the tweets which had at least two named entities, and randomly sampled an equal number of tweets from the set of tweets which had only one named entity.
4.3 Details of performance measures
Past TDSC work reports on 0/1 accuracy and macro averaged Fscores, and we do so too, for 3-class
instances and 2-class
subsets. However, SC is fundamentally a regression or ordinal regression task; e.g., loss(
1) > loss(
0) > loss(
1) = 0. Evaluating ordinal regression in the face of class imbalance can be tricky [1]. In addition, the system label may be discrete from
or continuous in [
1]. Therefore we report on two additional performance measures. Let (
) be the ith of I instances, comprised of a tweet, gold label, and system-estimated label. Mean absolute error (MAE) It is defined as (1/I)
. Down-stream applications that use the numerical values of ˆy will want MAE to be small.
Closely related to the area under the curve (AUC) for 2-class problems, PIR is widely used in Information Retrieval [10].
Table 1: Performance of various methods on Dong dataset: (a) overall, (b) class-wise. MTTDSC beats other baselines across diverse performance measures.
4.4 Various methods and their performance
Table 1 shows aggregated and per-class accuracy for the competing methods. It has three groups of rows. The first group includes methods that use no or minimal target-specific processing. The second group includes the best-known recent target-dependent methods. The third group includes our methods and their variants, to help elucidate and justify the merits of our design.
Target independent baselines: In the first block, LSTM means a whole-
tweet LSTM was applied, followed by a linear SVM on the final state. Targetind [9] pools embedding features from the entire tweet. Target-dep+ extends Target-dep. Target-dep+ extends Target-dep by identifying sentiment-revealing context features with the help of multiple lexicons (such as SentiWordNet ).
Prior target-dependent baselines: The second block shows more competi-
tive target-aware TDSC approaches. TDLSTM and TCLSTM are from Tang et al. [20]. Naive-Seg+ segments the tweet using punctuations. Word vectors in each segment are pooled to give a segment embedding. Additional features are generated from the left and right contexts based on multiple sentiment lexicons. TDParse [22] uses a syntactic parse to pool embeddings of words connected to the target. TDParse+ extends TDParse by adding features from sentiment lexicons. TDParse+(m) considers the presence of the same target multiple times in the tweet. Feature vectors generated from multiple target positions are merged using pooling functions.
MTTDSC and variations: The third block shows MTTDSC and some vari-
ations. TDGRU replaces the two LSTMs of TDLSTM with two GRUs [4] which have fewer parameters but perform slightly better than LSTMs. In TDGRU+SVM, we first train the TDGRU model. Then, at entity position i, we
Table 2: Performance of various methods on our Election2 dataset: (a) overall, (b) class-wise. The extreme label skew makes it easy for simpler algorithms to do well at MAE and PIR, although MTTDSC still leads in traditional measures, and recognizes neutral content better.
extract the featuresGRU
]; GRU
of the two GRU models and train an SVM with RBF kernel with the extracted features. TDGRU with SVM is expected to perform better due to the non-linear nature of the features at the penultimate layer, which the SVM can then recognize without overfitting problems. TD naive MTL is similar to MTTDSC, but, rather than having separate GRUs for primary and auxiliary tasks, shared GRU
and GRU
are used for both tasks. The tasks are trained jointly as in MTTDSC. In TDFT, we first train GRU
, GRU
and
on the auxiliary whole-passage SC task. We then use the weights of GRU
and GRU
learnt by the auxiliary task in TDGRU and train it on TDSC with a new
.
Observations: Table 1 shows that MTTDSC outperforms all the baselines
across all the measures on Dong dataset. MTTDSC achieves 2.8%, 3.41%, 7.14% and 24.5% relative improvements in accuracy, F, MAE and PIR respectively over TDParse+(m) (best model by Wang et al. [22]). The improvement in 2-class F
is also substantial (5.1%). MTTDSC maintains a better balance between precision, recall and F
across the three classes (Table 1(b)).
TDFT improves on TDLSTM and TDGRU because it learns important features during pre-training. TDFT is better than TD naive MTL; jointly training the latter results in auxiliary loss prevailing over primary loss. TD naive MTL also loses to MTTDSC, because, in fine tuning, the primary task training makes the model forget some auxiliary features critical for the primary task. Summarizing, MTTDSC’s gains are not explained by the large volume of auxiliary data alone; good network design is critical.
Table 2 shows results for Election2. The trend is preserved, with our gains in macro-Fbeing more noticeable than micro-accuracy. This is expected given the
Table 3: Performance comparison on Election1 dataset. The results of the baselines are taken from Table 3 of Wang et al. [22] where TDPWindow-12 (which extracts features exactly like TDParse+, but limits the size of left and right contexts to 12 tokens) was reported as the best model. To save space, we report the accuracy w.r.t. only three measures. The broad trends are similar to Election2.
label skew. Table 3 shows similar behavior on the Election1 dataset. Although TDPWindow-12 is slightly better for 0/1 accuracy, MTTDSC achieves 4.97% and 6.77% larger Fscore for 3-class and 2-class sentiment classification respectively.
4.5 Side-by-side diagnostics and anecdotes
Given their related architectures, we picked TDLSTM and MTTDSC, and focused on instances where MTTDSC performed better than TDLSTM, to tease out the reasons for the improvement.
Word-level sensitivity analysis: For each word in the context of the target
entity, we replaced the word with UNK (unknown word) and noted the drops in scores of labels +1 and 1. A large drop in the score of label +1 means the word was regarded as strongly positive, and a large drop in the score of label
1 means the word was regarded as strongly negative. We use these scores to color-code context words in the form of a heatmap. Figure 4 shows the positive and the negative words highlighted accordingly to their sensitivity scores. The words highlighted in green color contribute to the positive label and the words highlighted red contribute to the negative label. In the first row, MTTDSC correctly identifies funny as a positive word, whereas TDLSTM considers funny to be a negative word. TDLSTM also finds stronger negative polarity in neutral words like the and covering. In the second row, MTTDSC correctly identifies hilarious as a positive word, whereas TDLSTM finds hilarious strongly negative. In the third row, MTTDSC finds hilarious positive, whereas TDLSTM misses the signal. Although TDLSTM correctly identifies more positive words in the fourth row than MTTDSC, it also incorrectly identifies negative words like randomly and people, leading to an overall incorrect neutral prediction. The examples show that TDLSTM either misses or misclassifies crucial emotive, polarized context words.
We presented MTTDSC, a multi-task system for target-dependent sentiment classification. By exploiting the easier auxiliary task of whole-passage senti-
Table 4: Word sensitivity studies. (Must be viewed in color.) Green words are regarded as positive and red words are regarded as negative by the respective RNNs. Intensity of color roughly represents magnitude of sensitivity. TDLSTM makes mistakes in estimating the polarity of words independent of context, which lead to incorrect predictions. Assisted by the auxiliary task, MTTDSC avoids such mistakes.
ment classification, MTTDSC improves on recent TDSC baselines. The auxiliary LSTM learns to identify corpus-specific, position-independent sentiment in words and phrases, whereas the main LSTM learns how to associate these sentiments with designated targets. We tested our model on three benchmark datasets, of which we introduce one here, and obtained clear gains in accuracy compared to many state-of-the-art models.
The project was partially supported by IBM, Early Career Research Award (SERB, India), and the Center for AI, IIIT Delhi, India.
1. S. Baccianella, A. Esuli, and F. Sebastiani. Evaluation measures for ordinal regres- sion. In Intelligent Systems Design and Applications, pages 283–287, Pisa, Italy, 2009.
2. Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41–48, Montreal, Canada, 2009.
3. E. Choi, D. Hewlett, J. Uszkoreit, I. Polosukhin, A. Lacoste, and J. Berant. Coarse- to-fine question answering for long documents. In ACL, pages 209–220, 2017.
4. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
5. L. Dong, F. Wei, C. Tan, D. Tang, M. Zhou, and K. Xu. Adaptive recursive neural network for target-dependent Twitter sentiment classification. In ACL, pages 49– 54, Baltimore, Maryland, USA, 2014.
6. A. Esuli and F. Sebastiani. SentiWordNet: A publicly available lexical resource for opinion mining. In LREC, pages 417–422, 2006.
7. A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009):12, 2009.
8. N. Godbole, M. Srinivasaiah, and S. Skiena. Large-scale sentiment analysis for news and blogs. In ICWSM, pages 219–222, 2007.
9. L. Jiang, M. Yu, M. Zhou, X. Liu, and T. Zhao. Target-dependent twitter sentiment classification. In ACL, pages 151–160, Portland, Oregon, USA, 2011.
10. T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD Conference, pages 133–142. ACM, 2002.
11. Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
12. B. Liu. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1):1–167, 2012.
13. A. Maurer, M. Pontil, and B. Romera-Paredes. The benefit of multitask represen- tation learning. Journal of Machine Learning Research, 17(81):1–32, 2016.
14. A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In LREC, pages 1–3, Valletta, Malta, 2010.
15. B. Pang, L. Lee, et al. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2):1–135, 2008.
16. H. Peng, S. Thomson, and N. A. Smith. Deep multitask learning for semantic dependency parsing. arXiv preprint arXiv:1704.06855, 2017.
17. J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation. In EMNLP Conference, volume 14, pages 1532–1543, 2014.
18. S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard. Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142, 2017.
19. N. Sanders. Twitter sentiment corpus, 2011.
20. D. Tang, B. Qin, X. Feng, and T. Liu. Effective LSTMs for target-dependent sentiment classification. arXiv preprint arXiv:1512.01100, 2015.
21. Z. Teng, D.-T. Vo, and Y. Zhang. Context-sensitive lexicon features for neural sentiment analysis. In EMNLP, pages 1629–1638, Austin, Texas, USA, 2016.
22. B. Wang, M. Liakata, A. Zubiaga, and R. Procter. TDParse: Multi-target-specific sentiment recognition on Twitter. In EACL, pages 483–493, 2017.
23. T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing contextual polarity in phrase- level sentiment analysis. In EMNLP, pages 347–354, Vancouver, B.C., Canada, 2005.