Multi-Lingual Twitter Sentiment Analysis (MLTSA)
Huang, Binxuan & Carley, Kathleen M.
In the past decade, the use of social media has been growing rapidly. Every minute there are thousands of millions of data points generated by internet users. Much of this data takes the form of words. This user-generated content contains information about what people think, have opinions about, and are doing. Analyzing these text documents is part of the area of natural language processing. Analyzing the opinion in these documents is referred to as sentiment analysis. The task of sentiment classification is to determine whether the opinion of the writer is positive, negative or neutral vis-à-vis a topic.
A lot of work have been done in this research area. However, those works mainly focus on English documents and rarely try to do sentiment classification in a multi-lingual environment. Social media data; however, is multi-lingual. In contrast to sentiment classification in one language, there are numerous additional challenges in dealing with multiple languages. For example, we know the grammars and syntax features in English, but we are unfamiliar with the sentence structures in other languages. As another example, there are also many mature lexicons and corpus in English prepared for researchers against which a classifier can be tested; however, these to not exist for other languages.
Another problem with the current sentiment approaches is that they simply note whether a text is positive or negative or neutral. They do not assess what that sentiment is directed toward.
This project is to develop a scalable multi-lingual sentiment analysis tool, that not only handles multiple languages but also assess what the sentiment is directed toward.
A direct way to solve the multi-lingual problem is to translate the original sentence into English by machine translation since there are a lot of sentiment analysis tools in English. Our current result shows that mapping sentences into English will alter the sentiment information in original language thus decreases the accuracy of sentiment classification compared to doing sentiment classification in original language.