Social media platforms have become popular for sharing sentiments towards a variety of topics. However, the texts on such platforms are often influenced by regional languages. This gives birth to a distinct multilingual phraseology that utilizes informal diction, nonstandard abbreviations, improper grammar, and tends to switch
between languages mid-utterance, a phenomenon known as code-switching [5]. As a consequence, the task of automatic sentiment classification becomes highly challenging.
Deep learning models have been successful for many NLP tasks involving multilingual and code-switched text. One way to improve the predictive performance of a model is to annotate each word with its respective language (code-switching indications) [12]. A serious limitation with this approach is its scalability for large data, as the annotation task becomes laborious. More recent approaches translate the under-resourced language into English and then use the resources of the English language to solve the problem in hand [3, 11, 13]. However, this approach is only practical for languages with robust translation resources. Therefore, such strategy is unfeasible for an informal language. It is established that pre-trained word embeddings give a boost to predictive performance of language models [9]. However, such embeddings are limited to English language only with no equivalence for informal languages. An alternative, therefore, is to use embedding [1]. Such embeddings are available in the form of pre-trained models on large scale data of English language, hence are well-suited for any language that uses English alphabets.
The focus of this paper is a specific informal and multilingual dialect of communication known as , which utilizes English alphabets to write Urdu and tends to code-switch between English and Urdu. Despite its prevalence, Roman Urdu has received little attention and research on this lags behind due to the non-availability of gold-standard datasets.
Our first contribution is that we develop an annotated dataset called MultiSenti for the problem of sentiment classification of Roman Urdu short text. Our second contribution is that we investigate the feasibility of adapting character-based pre-trained embedding models for sentiment classification of Roman Urdu short text. To exhibit the contrast with adapted embeddings, we also train our own word-based multilingual embeddings on the Roman Urdu corpus. Our third contribution is that we propose a deep learning model for sentiment classification of Roman Urdu short text, namely McM. The model tends to learn from raw text only without utilizing lexical normalization, language translation, language transliteration, or code-switching indication. The performance of the proposed
Table 1: MultiSenti dataset characteristics
model is compared with three existing multilingual sentiment classification models. The results demonstrate that McM outperforms other models in all of the experiments. The study also proves the practicality and usefulness of adapting character-based pre-trained embeddings from English language for Roman Urdu language.
The MultiSenti dataset is collected from Twitter during and after the general elections of Pakistan in the year 2018 to identify the overall emotion and sentiment of populous towards the on-going election process and its result. The dataset has been categorized into , and sentiments. A sentiment in a tweet can either be expressed in monolingual or multilingual form, i.e., (i) Roman Urdu, (ii) English, and (iii) Mixed. Preprocessing of the data is kept minimal to the extent of lowering the cases and removing all the records having only single word in tweet. The “gold standard" is constructed by manually annotating 20, 735 samples into predefined categories by two annotators in supervision of a domain-expert. In case of conflict between annotators, decision of domain expert is considered. Class labels percentages and language ratios in the dataset are presented in the Table 1. Class-based stratified sampling at 80 20% is adopted for generating train and test splits of the data. These splits are made available publicly1.
3.1 Language Resource Adaptation
We first examine the feasibility of resource adaptation involving deep learning models and word embedding choices. We select three models with strong predictive performance on multilingual sentiment classification: (i) ConvNets [6], (ii) Attention-LSTM [13], and (iii) SimpleConv [2]. All models are reimplemented using hyperparameters as defined in the original studies.
As regards to word embedding choices, a well-known problem is that of “out-of-vocabulary" where certain words are not found in the embedding base. In such cases, either random initialization of embeddings [2] or using character-based pre-trained embeddings [1] is plausible. We investigate both strategies on MultiSenti dataset using all three adapted deep learning models mentioned above. For our experiments, random embeddings are initialized from a uniform distribution with 300 dimensions. The choice of character-based pre-trained embeddings is restricted to ELMo [8], which is trained on a large-scale English language corpus and produces an embedding of 1024 dimensions. During the training of a model, embedding layer can be finetuned or training can proceed without finetuning [13]. To assess the out-of-the-box performance of pre-trained embedding model, we also take prediction directly from ELMo by introducing a softmax layer on top of it. In this way, a total of 14 experiments are performed and their results are shown in Table 2 (lines 1-2 for each adapted model).
The experiments reveal that ELMo out-of-the-box performs on par with the other variations, though finetuning does not affect its predictive performance. However, random embedding initialization benefits from finetuning on all three models. Slightly superior results are achieved when ConvNet and Attention-LSTM models
are used on top of ELMo. Interestingly, SimpleConv model shows significant decline in performance when ELMo embedding without finetuning is used. However, with finetuning, it is able to achieve comparable results with other variants. It is also worth noting that employing random embedding without finetuning yields lowest performance. These planned comparisons reveal that finetuning the embedding layer is more beneficial as compared to freezing the weight updates during the training, and using a deep model on top of an embedding layer is a perceptive choice. However, all models on this informal language dataset underperform relative to the results reported on formal languages such as English, French, Greek, and Chinese [6, 13]. These observations clearly indicate that existing models for formal languages are not well-suited for informal language and call for novel model architectures specifically tailored for informal and code-switched language.
3.2 Proposed Model
Our proposed deep learning model, called McM, employs three feature learners (cascades) that are trained for classification independently (in parallel) as shown in Figure 1. The learned features from these learner are forwarded to a discriminator network for final prediction. Each of these four components is discussed below.
3.2.1 Stacked-CNN Learner: This learner is employed to learn n-gram features for identification of relationships between words. A 1-d convolution filter is used with a sliding window (kernel) of size k (number of n-grams) in order to extract the features. Two CNN layers are stacked which use k = 1 and k = 2 respectively. An activation function ReLU, which is defined as , is used to introduce non-linearity. We use 300 filters and stride = 1 for both layers. The output of second CNN layer is followed by (i) global max-pooling to remove low activation information from feature maps of all filters, and (ii) global average-pooling to get average activation across all the n-grams. These two outputs are then concatenated and forwarded to a small feedforward network having two fully-connected layers, followed by a softmax layer for the prediction of this particular learner. Dropout layer with a rate of 0.5 and batch-normalization layer is repeatedly used between both fully-connected layers to avoid over-fitting.
3.2.2 Stacked-LSTM Learner: LSTM captures the order information of words where each word is treated as one time step and is fed to LSTM in a sequential manner. While processing the input at the current time step , LSTM also takes into account the previous hidden state . Stacked-LSTM learner is comprised of two LSTM layers. The output of the first LSTM layer is fed to the second LSTM layer and the output produced by second LSTM layer is forwarded to global max-pooling and global-average pooling layers. The former drops the low activations while the latter averages activations across all time steps. These two outputs are concatenated and forwarded to a two-layered feedforward network for intermediate supervision, identical to previously described stacked-CNN learner. We use 300 LSTM units in both layers.
3.2.3 LSTM learner: This learner is employed to learn long-term dependencies of the text as described in [10]. This learner encodes complete input text recursively and returns a single vector. The dimensions of the output vector are equal to the number of LSTM
Figure 1: Multi-cascaded model (McM) for sentiment classification of informal short text
units deployed. This encoded text representation is then forwarded to a small feedforward network identical to the aforementioned two learners, for intermediate supervision in order to learn features. This learner differs from stacked-LSTM learner as it learns sentence features, not average and max-features of all time steps (input words). This learner uses 300 LSTM units.
3.2.4 Discriminator Network: This small feedforward network aggregates features learned by each of the above described three learners and squash them into a small network for final prediction. It employs two fully-connected layers with dropout and batch-normalization layer along with ReLU activation function for non-linearity. The softmax activation function with categorical crossentropy loss is used on the final prediction layer to get probabilities of each class. The class label is assigned based on maximum probability. This is treated as the final prediction of the proposed model. Note that the choices of the number of convolutional filters, number of units in dense layers, and number of LSTM units are made empirically. Rest of the hyperparameters (choices of k, dropout rate, optimizer, and learning rate) were selected by performing a grid search using a 20% stratified validation set taken from training set and utilizing random embedding initialization without finetuning. The complete architecture, along with dimensions of each output is shown in Figure 1. The network is optimized using “Adam" optimizer with a learning rate of 0.002.
3.3 Multilingual Embeddings
We also compare multilingual embeddings constructed from a combined corpus constituted of MultiSenti dataset and another large scale Roman Urdu dataset (These embeddings are made available along with dataset). The total number of words in this combined corpus was more than 6.5 million. We use skip-gram model of word2vec with word vector of size d = 300 as suggested in original study [7]. These embeddings are trained for 500, 000 iterations.
3.4 Implementation Details
All the implementation is done in Python using Keras library with Tensorflow backend. All weights of the networks are initialized randomly and to mitigate the effect of randomness, random seed is fixed across all experiments. For every experiment, the model is trained for 100 epochs. A checkpoint of the learned weights is saved at epoch with the best predictive performance on the test split. The early stopping approach is also opted and training is stopped if testing error does not decrease for 10 epochs.
We report performance of all variations using accuracy and macroaveraged precision, recall,, and F1-score in Table 2. In the discussion, however, we focus on F1-score . Based on the results, we make the following observations.
Using pre-trained embeddings out-of-the-box yields identical performance when used either without or with finetuning. Specifically talking about the case when a model is used on top of the embeddings, ELMo embeddings without finetuning outperform the random and multilingual embeddings on ConvNet and AttentionLSTM. Interestingly, in the case of SimpleConv model, ELMo yields the poorest performance. Further examination of this particular case revealed that this particular model is unable to learn when pre-trained embeddings are used. However, using random embeddings for SimpleConv gives output comparable to other models. This implies that the model is specifically engineered to work with random embedding (as is evident from the original study). Regarding the use of random embedding for other models, the proposed model McM achieves highest F1-score. Amongst rest of the models, ConvNet marginally outperforms Attention-LSTM. As far as the use of multilingual embeddings is concerned, it was found that the least F1-score was achieved by SimpleConv, while McM achieved
Table 2: Performance evaluation of variations of the proposed models and baselines. (Showing highest scores in boldface.)
the highest score, which surpasses all the experiments without finetuning the embeddings.
Turning now to the case of finetuning, ConvNet performs identical in terms of F1-score for all of the embeddings, while AttentionLSTM and SimpleConv benefit from finetuning when random embeddings are used. In regards to McM model, ELMo embeddings yield the highest F1-score of 0.65. This is an interesting finding as it is identical to the F1-score of McM when multilingual embedding without finetuning is used.
It is worth noting that even though simpler networks such as ConvNets and SimpleConv take the least amount of training time, their performance is inconsistent across all settings. While the proposed model McM shows the highest performance in the majority of the cases with 3% variation for each embedding. These findings lead to conclude that no apparent advantage exists in training word-based multilingual embeddings from scratch. The pre-trained character-based embedding on the English language with finetuning suffices for informal language to get identical results while avoiding pre-training overhead. However, to get most out of these embedding, a carefully tailored model for sentiment classification of informal short text is crucial. One can argue that embeddings trained on an informal multilingual corpus, which is comparable in size to the corpus of English language, could yield better performance than adapting the embeddings. However, this leads to the initial paradox of not having enough data resources for the informal languages.
Our work has led us to conclude that adapting existing resources from a resource-rich language to an informal language is practical. It is evident from the results that an embedding trained on sufficiently large corpus in the English language can successfully be adapted for an informal language. However, this is not necessarily true for the model choice. It is crucial that a model is engineered specifically towards an informal language as compared to adapting models developed for other languages. As future research, we plan to investigate other embedding choices such as BERT [4].
[1] Monika Arora and Vineet Kansal. 2019. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis. Social Network Analysis and Mining 9, 1 (2019), 12.
[2] Mohammed Attia, Younes Samih, Ali Elkahky, and Laura Kallmeyer. 2018. Mul- tilingual Multi-class Sentiment Classification Using Convolutional Neural Networks. In International Conference on Language Resources and Evaluation. 635– 640.
[3] Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. 2018. Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification. Transactions of the Association for Computational Linguistics 6 (2018), 557–570.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.
[5] Mehwish Fatima, Saba Anwar, Amna Naveed, Waqas Arshad, Rao Muhammad Adeel Nawab, Muntaha Iqbal, and Alia Masood. 2018. Multilingual SMSbased author profiling: Data and methods. Natural Language Engineering 24, 5 (2018), 695–724.
[6] Lisa Medrouk and Anna Pappa. 2017. Deep learning model for sentiment analysis in multi-lingual corpus. In International Conference on Neural Information Processing. 205–212.
[7] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119.
[8] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2227–2237.
[9] Muhammad Haroon Shakeel, Asim Karim, and Imdadullah Khan. 2019. A Multicascaded Deep Model for Bilingual SMS Classification. In International Conference on Neural Information Processing. 1–12.
[10] Xingyou Wang, Weijie Jiang, and Zhiyong Luo. 2016. Combination of convolutional and recurrent neural network for sentiment analysis of short texts. In International Conference on Computational Linguistics. 2428–2437.
[11] Zhongqing Wang, Sophia Lee, Shoushan Li, and Guodong Zhou. 2015. Emotion detection in code-switching texts via bilingual and sentimental information. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 763–768.
[12] Zhongqing Wang, Yue Zhang, Sophia Lee, Shoushan Li, and Guodong Zhou. 2016. A bilingual attention network for code-switched emotion prediction. In International Conference on Computational Linguistics. 1624–1634.
[13] Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016. Attention-based LSTM network for cross-lingual sentiment classification. In Conference on Empirical Methods in Natural Language Processing. 247–256.