Numerous examination works are done on Sentiment Analysis of English. Cui et al. [2] dealt with online item audits. They arranged the audits to two noteworthy classes: positive and negative. They considered around 100k item audits from diverse sites. Jagtap et al. [3] connected Support Vector Machine (SVM) and Hidden Markov Model (HMM). Their cross breed characterization model to remove the conclusion of educator criticism evaluation performed well. Alm et al. [4] Separated seven passionate words to three polar classes of positive passionate, negative enthusiastic and impartial. They utilized Winnow parameter tuning approach and got 63% precision. Agarwal et al. [5] connected unigram, tree model and highlight based model to remove twitter assumption. Unigram display is outflanked by tree model and highlight based model. The exactness they got is around 61%. Zou et al. [6] presented a model of learning bilingual word embeddings from an extensive and unlabeled dataset. They demonstrated that their model beats baselines in semantic closeness of words. Turian et al. [7] chipped away at Brown groups, embeddings of Collobert and Weston (2008) and various leveled log-bilinear embeddings. Chen et al. [8] proposed some methodologies that can separate the discharged word embeddings models. They demonstrated that embeddings can identify astounding semantics of the sentences even without having the structure. Tang et al. [9] presented a system of social affair both relevant and estimation data of the words by learning Sentiment-Specific Word Embedding. They connected their model to remove twitter feeling. The precision they got is around 83%. Demand et al. [10] Worked on skip-gram show with negative inspecting of Mikolov et al.(2013) and summed up it. They separated reliance based settings and demonstrated that they deliver diverse sorts of similitudes.
A couple of research works are done on Bengali which is noted in details in a comprehensive study on sentiment of Bengali text[11]. Chowdhury et al. [12] Applied Support Vector Machine (SVM) and Maximum Entropy (MaxEnt) to distinguish the supposition of Bengali microblog posts. They tested these two strategies by joining them with various kinds of highlights. Hasan et al.[13] portrayed a method of identifying the notions of Bengali messages by Contextual Valency Analysis. Islam et al. [14], Hossain and Dey took an approach using Naïve Bayes classification model for Bengali Language. There a supervised classification method is used with language rules for detecting sentiment for Bengali Facebook Status.
Al Amin et al. [15], Islam et al. and Uzzal et al. worked on word embedding with Hellinger PCA to detect the sentiment of Bengali comments. Word co-occurrence matrix is constructed with skip gram to determine the contextual information of the comments and sliding windows are created to gather similar words in the windows. They also took a new approach[16] of sentiment classification of Bengali comments with word2vec and Sentiment extraction of words are presented. Combining the results of word2vec word co-occurrence score with the sentiment polarity score of the words, the accuracy obtained is 75.5%. Mahmud et al. [17], Mohaimen, Islam and Jannat took an approach to predict movie success by analyzing public sentiments with a support vector machine and statistical reasoning. They used non linear RBF kernel for our sentiment classifier which achieved better accuracy than the classifiers that use linear kernels in the famous IMDB Movie Review Dataset (89.51% accuracy) and also in the Pang and Lee Movie Review Dataset (86.86% accuracy). Using our system they can predict whether a movie will be successful or not with an accuracy of 90.3%
For this analysis, we worked on two classes, Approval and Disapproval. Approval, in this context, means that the comment approves of the Rohingya people movement inside the country, or at least approves of the Rohingya community as human beings with all the human rights.Disapproval, in this context, means that the comment does NOT approve of the Rohingya community and intends to display a vibe of hatred or dislike towards their well being.Initially we tried to classify the comments with a NaiveBayes classifier but the accuracy and precision of the classifier was less than impressive. That was probably because the context of our comments were sufficiently complex as they were very political and incredibly sentence-structure based. Another reason might have been because for the Naive-Bayes classifier, as Ngrams, we used uni grams. Since the sentence structure role played a vital part in the sentiment of the comments, a naive bayes algorithm with uni grams did not really give us satisfactory results, with an accuracy of only 67%.e.g "The Rohingya muslims are terrorists" was classified as a negative sentiment for the approval of the Rohingya movement while "The Myanmar army are the terrorists" was also classified as a negative sentiment.We approached to solve this problem by completely changing our algorithm to a Support Vector Machine which uses uni-grams and bi-grams. The results were, by a big margin, more accurate.The training corpus for this project was arguably the biggest difficulty we faced as we did not find any free open source training corpus based on the Rohingya conflict on the net. Consequently, we had to create, download and label our own training corpus of 5000 comments and use it as our training corpus.
We have applied a set of pre-processing steps to make the comments suitable for the SVM algorithm and improve performance. The following pre-processing has been done on the comments:
i. Lower Case - Convert the comments to lower case ii. URLs - Convert www.* or https?://* to 'URL' iii. @username - Convert username to '__HANDLE'
iv. #hashtag - Hash tags can give us some useful information, so we replace them with the exact same word without the hash. E.g. #Apple replaced with 'Apple' v. Trimming the comment vi. Repeating words: People often use repeating characters while using colloquial language, such as "I’m happyyyyy". We replace characters repeating more than twice with just two characters, so that the result for above would be "I'm happyy" vii. Emoticons: Use of emoticons is prevalent in comments. We identify a set of emoticons and replace them with the representative sentiment i.e. 'positive' or 'negative'. E.g. ':)' is replaced by 'positive'. Positive refers to approval and negative refers to disapproval. Further, if emoticon(s) are found in the comments, then the SVM classifier is not called and the comment is classified as positive or negative simply based on the emoticon.Stemming algorithms are used to find the “root word” or stem of a given word. We have used the PorterStemmer.Tuning of parameters was done to improve the performance of the SVM classifier. The following parameters are found to give the best results on the cross validation set (20% of the Training Corpus) without compromising much on the speed.
i. TfidfVectorizer:
The algorithm achieves an overall precision, recall and f1-score of 0.79 (79%). The details can be found in table below (can be reproduced by running training.py):
From the experiment , we took 20% of the data set as testing data and we found that the Support Vector Machine yields an accuracy of up to 79% in analysing context based comments, which the Naïve Bayes fails to do as it cannot really take the context into account with Uni grams as features. We did not however take any specific features and even left out the word position in the sentences which caused accuracy loss as the algorithm struggled to find context. The data set was considerably small and our approach classifies the comments in only two categories. More classification categories, such as the neutral category, can be created for a better analysis. This thesis creates a platform to analyse political data and future work can be done to improve the aforementioned problems.
[1] Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (eds.), “Ethnologue: Languages of the World,” Nineteenth edition. Dallas, Texas: SIL International, 2016. [2] Hang Cui, Vibhu Mittal and Mayur Datar, “Comparative Experiments on Sentiment Classification for Online Product Reviews,” Proceedings of the 21st National Conference on Artificial Intelligence, AAAI, Boston, MA, 2006. [3] Balaji Jagtap and Virendrakumar Dhotre, “SVM and HMM Based Hybrid Approach of Sentiment Analysis for Teacher Feedback Assessment,” International Journal of Emerging Trends & Technology in Computer Science (IJETTCS), Volume 3, Issue 3, May-June 2014. [4] C. Alm and D. Roth and R. Sproat, “Emotions from text: machine learning for text-based emotion prediction,” EMNLP, 2005. [5] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow and Rebecca Passonneau, “Sentiment Analysis of Twitter Data,” LSM '11 Proceedings of the Workshop on Languages in Social Media, Pages 30-38, 2011. [6] Will Y. Zou, Richard Socher, Daniel Cer and Christopher D. Manning, “Bilingual Word Embeddings for Phrase-Based Machine Translation,” SemEval, 2012. [7] Joseph Turian, Lev Ratinov and Yoshua Bengio, ”Word representations: A simple and general method for semisupervised learning,” Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384– 394, Uppsala, Sweden, 11-16 July 2010. [8] Yanqing Chen, Bryan Perozzi, Rami Al-Rfou, and Steven Skiena, “The Expressive Power of Word Embeddings,” ICML 2013 Workshop on Deep Learning for Audio, Speech, and Language Processing, Atlanta, USA, June 2013.
[9] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu and Bing Qin, “Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification,” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 1555–1565, Baltimore, Maryland, USA, June 23-25, 2014.
[10] Omer Levy and Yoav Goldberg, “Dependency-Based Word Embeddings,” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 302– 308, Baltimore, Maryland, USA, June 23-25, 2014.
[11] Al-Amin, Md, Md Saiful Islam, and Shapan Das Uzzal. "A comprehensive study on sentiment of Bengali text." In Electrical, Computer and Communication Engineering (ECCE), International Conference on, pp. 267-272. IEEE, 2017.
[12] Shaika Chowdhury and Wasifa Chowdhury, “Sentiment Analysis for Bengali Microblog Posts,” International Conference on Informatics, Electronics & Vision (ICIEV), 2014.
[13] K. M. Azharul Hasan, Mosiur Rahman and Badiuzzaman, “Sentiment Detection from Bengali Text using Contextual Valency Analysis,” 17th Int'l Conf. on Computer and Information Technology, Daffodil International University, Dhaka, Bangladesh, 22-23 December 2014.
[14] Islam, Md Saiful, Md Ashiqul Islam, Md Afjal Hossain, and Jagoth Jyoti Dey. "Supervised approach of sentimentality extraction from bengali facebook status." In Computer and Information Technology (ICCIT), 2016 19th International Conference on, pp. 383-387. IEEE, 2016.
[15] Islam, Md Saiful, Md Al-Amin, and Shapan Das Uzzal. "Word embedding with hellinger PCA to detect the sentiment of bengali text." In Computer and Information Technology (ICCIT), 2016 19th International Conference on, pp. 363-366. IEEE, 2016.
[16] Al-Amin, Md, Md Saiful Islam, and Shapan Das Uzzal. "Sentiment analysis of Bengali comments with Word2Vec and sentiment information of words." In Electrical, Computer and Communication Engineering (ECCE), International Conference on, pp. 186-190. IEEE, 2017.
[17] Mahmud, Quazi Ishtiaque, Asif Mohaimen, and Md Saiful Islam. "A support vector machine mixed with statistical reasoning approach to predict movie success by analyzing public sentiments." In Computer and Information Technology (ICCIT), 2017 20th International Conference of, pp. 1-6. IEEE, 2017.