As we live in an increasingly digitized society, algorithms for text analysis and generation can be used for a variety of purposes and may greatly relieve manual work. A system for robust manipulation of global text properties, e.g. sentiment, is one such algorithm that could potentially change how we work with text and open up new possibilities. Though the main purpose of a text might be to communicate a concrete message there are an infinite number of ways the message can be phrased, each with an individual set of global properties connected to it. In this paper we focus on the sentiment aspect and note that robust control over the sentiment would open up a range of new possibilities, like AB testing of different instantiations of a message with respect to some desired measure, and personalized communication automatically adapted to the receiver’s profile. Further, the ability of generating new sentences with transformed sentiment could also be useful in data augmentation when the available data is scarce.
Recent work in text generation (Hu et al., 2017; Radford et al., 2017) has shown that it is possible to generate random sentences where the sentiment can be chosen as an input parameter. This line of research has some similarities to the problem we are addressing in this paper but with the key difference that while they generate new random sentences we aim to transform existing sentences. This makes the problem more difficult but also more applicable to real world applications as shown by the recent work of Mueller et al. (2017).
In the visual domain there has been a range of work lately that aims to transform the input image to fit different aspects, e.g. to look like a painting (Gatys et al., 2015). The method presented by Gardner et al. (2015) transforms an image to a deep feature space using a convolutional neural network (CNN). This space is then traversed towards the target features. A new image is subsequently reconstructed from the deep feature representation but where some aspect has been changed from the original image. In their experiments they show that this can be used to transform a smiling portrait into an angry one and make one individual look more like someone else without changing clothing or background. The method we present in this paper is loosely based on their model, however, with significant changes due to the discrete nature of language.
The main contributions of this work include: (1) an algorithm that can automatically transform the sentiment of a text while leaving the semantic content largely intact, and (2) preliminary qualitative analysis of the performance with regard to (a) resulting sentiment, (b) semantic stability and (c) acceptability of the transformed text.
The maximum mean discrepancy (MMD) (Gretton et al., 2012) is a test statistic used to determine whether two distributions are the same. Given two distributions, and
, the objective of the MMD is to find a smooth function which is large for samples from
and small for samples from
. Given such a function the MMD is the difference between the mean function values for the two sets of samples, which can be empirically estimated as MMD(F, X, Y ) =
where are samples drawn from the source distribution
and Y =
are samples drawn from the target distribution
. The function f belongs to a class, F, of smooth functions and should be chosen as to maximize the difference between the mean values of f applied to X and Y . In both (Gretton et al., 2012) and (Gardner et al., 2015), F is a reproducing kernel Hilbert space allowing comparison of multi-dimensional feature vectors. The function
attaining the supremum in equation (1) can be empirically estimated as
where is a kernel function. The method presented by Gardner et al. (2015) uses a Gaussian kernel function
being the kernel bandwidth.
The problem we are addressing can be split into three different subtasks. The first task is representing sentences in a continuous space. The second task is exploiting the sentence representation and traversing the manifold in such a way that the sentiment changes. The third task is generating a sentence from the representation space. Our model uses a CNN for sentence encoding. The encoded vectors are subsequently traversed using the MMD statistic and finally decoded using a recurrent neural network (RNN).
3.1 Encoding sentences
A sentence is represented as a matrix where the rows correspond to the, 300-dimensional, word2vec (Mikolov et al., 2013) word embeddings for each word in the sentence. This matrix is given as input to a CNN, trained for binary sentiment classification. The network consists of one convolutional layer, one max-pooling layer and finally one fully connected feed forward layer. The filter heights for the convolutional layer are 1, 2, 3 and 4, and the filter width is 300. 75 filters per size results in a total of 300 filters. The pooling layer therefore outputs a 300-dimensional feature vector, denoted z. This feature vector is extracted from the CNN, along with the predicted label, and used as the encoding of the input sentence.
In addition to classifying sentiment, the CNN needs to encode information about the topic and semantics of the sentence. Therefore, it is trained together with the RNN. Initially, the sentiment classification task is disregarded and the joint networks are trained for encoding and decoding unlabeled sentences. The loss for this task is measured by calculating the cross-entropy error between the predicted word, , at position t, in the generated sentence and the actual word, w, at the same position from the original sentence. After this initial training phase, the CNN is trained on binary sentiment classification. The classification loss is calculated as the cross-entropy error between the
Figure 1: During training, the CNN and RNN are updated using the unweighted sum of the loss for sentiment classification and for text generation.
Figure 2: Different icons distinguish feature vectors by sentiment and topic. Bold faced points are examples of original and traversed vectors.
predicted label and the true label for each sentence. This loss is added to the text generation loss, producing a total loss which is used to update the weights in both networks. A schematic of the training procedure is illustrated in figure 1.
3.2 Traversal of the representation space
Since the CNN is trained on binary sentiment classification, two separable distributions of feature vectors are generated. The MMD statistic can be used to traverse a vector originating from one of these distributions to the other. The result of the traversal is a vector that resembles the encoding of a sentence with the opposite sentiment.
When moving the feature vector z by minimizing equation (2), the semantics of the original sentence may be lost if z is moved too far along the manifold. To control how far z is moved from its original position a budget of change (Gardner et al., 2015), , is used. A source and a target set of sentence representations are created. The source set,
, contains feature vectors for sentences with the same sentiment as z and the target set,
, contains feature vectors for sentences with the opposite sentiment. From these sets and the original vector, a matrix
is formed. The traversed feature vector can then be expressed as
, where
is a coefficient vector.
Equation (2) can now be written as
where . The minimization over
uses the BFGS algorithm (Battiti, 1990) and is constrained by the budget of change, enforced in the last term.
3.3 Decoding sentences
The traversed feature vector is given as input to an RNN trained for generating text. In addition to
, the RNN receives a start-of-sentence token as input in the first time step. For each time step, the RNN outputs the most probable word and gives this word as input to the next time step. When the most probable word is an end-of-sentence token, the generation of words is terminated. The RNN consists of a single layer GRU cell, with a state size of 300. The weight matrix for the input,
, consists of the 300-dimensional word2vec word embeddings for the words in the vocabulary.
The initial encoding and decoding training uses the large movie review dataset v1.0 (Maas et al., 2011) disregarding the label. The networks are then trained on three sentiment labelled data sets. The first set is the movie review sentence polarity data set v1.01 (Pang and Lee, 2005) which consists of 10 662 labelled movie-review sentences from www.rottentomatoes.com. The second set contains 500 reviews for cell phones and accessories from Amazon, 500 reviews for restaurants from Yelp and 500 movie reviews from IMDB2 (Kotzias et al., 2015). These two sets have equal amounts of positive and negative sentences. The third set is a subset of 923 positive and 1320 negative sentences from a data set3 containing product reviews from various online sources (Täckström and McDonald, 2011). The three data sets are randomly divided 90%-10% into a training and a test set. The training set is used for updating the weights of the networks during training and is divided into batches of 64 sentences. The test set is used for evaluating the accuracy of the networks periodically during training.
4.1 Preserving semantics
In order to evaluate whether the encodings from the CNN contain information about sentiment and semantics, feature vectors for the sentences with different sentiments and topics are visualized. These visualizations also serve as an aid for assessing whether the content in a sentence is preserved in the traversal. The feature vectors are reduced from 300 to 2 dimensions using principal component analysis (PCA) and the visualizations are made using the first two principal components.
The choice of topics was sentences containing either the word phone or movie, because such sentences would likely have little correlation in contrast to, for example, sentences containing either comedy or drama. Negative sentences containing the word movie and positive sentences containing the word phone were traversed. The optimization of the MMD was set up with 90 positive examples and 90 negative examples for the source and target sets, and . The examples consisted of an equal amount of sentences containing the word movie and sentences containing the word phone. The topics of the sentences were not used for the traversal but needed when visualizing the results.
The results are shown in figure 2. It is seen that a vector representing a positive sentence containing movie is moved so that the resulting vector lies within the cluster of negative sentences containing movie. In the same way, a vector representing a negative sentence containing phone is moved so that the resulting vector lies within the cluster of positive sentences containing phone. This behaviour suggests that the context and semantics may be preserved during the traversal.
Since the manifold traversal is made using two sets of examples, source and target feature vectors, the traversed feature vector will more resemble the sentences in the target set. This means that if we traverse the manifold for a sentence with a different topic than the sentences in the source and target sets, the traversed vector might not preserve the topic of the original sentence.
4.2 Analysis of transformed sentences
There exists no single correct output for the manifold traversal, e.g given the negative sentence “The food did not taste well”, both sentences “The food was amazing” and “I liked the food” are valid outputs that reverse the sentiment. Therefore, scores and measures used for other NLP tasks, like BLEU (Papineni et al., 2002) for machine translation, are difficult to apply to the manifold traversal. Instead we focus on qualitative evaluation. The encoding-decoding, and the model as a whole, is evaluated by generating sentences from the feature vectors z (representing the original sentence) and (the traversed vector) respectively. The generated sentences are manually compared to the original. Ideally, the sentence generated from z should closely resemble the original sentence while the sentence generated from
should have the same context, but opposite sentiment, as the original sentence. In table 1 some of the better examples of sentences generated by the trained RNN are shown. The overall impression is that, while having poor grammar, the model works well in terms of changing sentiment. We see that the generated sentences have the same topic as the original and that they are composed mostly by the same words. It is also found that shorter sentences are more easily encoded and decoded.
An algorithm for sentiment manipulation was presented and evaluated. Visualizations of the embedding space indicate that sentence representations can be moved such that the sentiment changes while the semantics is preserved. Further, examination of generated sentences from manipulated embeddings confirmed that the sentiment had changed while the semantics and acceptability had stayed largely constant.
The authors would like to acknowledge the project Towards a knowledge-based culturomics supported by a framework grant from the Swedish Research Council (2012–2016; dnr 2012-5738).
Roberto Battiti. 1990. Optimization methods for back-propagation: Automatic parameter tuning and faster convergence. In International Joint Conference on Neural Networks. volume 1, pages 593–596.
Jacob R. Gardner, Matt J. Kusner, Yixuan Li, Paul Upchurch, Kilian Q. Weinberger, and John E. Hopcroft. 2015. Deep manifold traversal: Changing labels with convolutional features. CoRR abs/1511.06421.
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2015. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 .
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. Journal of Machine Learning Research 13(Mar):723–773.
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Controllable text generation. arXiv preprint arXiv:1703.00955 .
Dimitrios Kotzias, Misha Denil, Nando de Freitas, and Padhraic Smyth. 2015. From group to individual labels using deep features. In KDD. ACM, pages 597–606.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, pages 142–150.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. pages 3111–3119.
Jonas Mueller, David Gifford, and Tommi Jaakkola. 2017. Sequence to better sequence: continuous revision of combinatorial structures. In International Conference on Machine Learning. pages 2536–2544.
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categoriza- tion with respect to rating scales. In Proceedings of ACL. pages 115–124.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’02, pages 311–318. https://doi.org/10.3115/1073083.1073135.
Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. Cite arxiv:1704.01444.
Oscar Täckström and Ryan McDonald. 2011. Discovering fine-grained sentiment with latent variable structured prediction models. In Proceedings of the 33rd European Conference on Advances in Information Retrieval. Springer-Verlag, Berlin, Heidelberg, ECIR’11, pages 368–374.