The Sentiment Analysis Primer series is my attempt at documenting the work I’m doing as part of my Bachelor’s thesis to facilitate better understanding of this domain. I’ll start from the simplest dictionary matching techniques and build up to more complicated techniques involving Deep and Aspect based learning. In order to save you from the pain of finding resources all the resources I had to find, I’ve added a set of the most essential references which are easy to follow and external links wherever necessary.
This post introduces the datasets and preprocessing that will remain common throughout the series. We’ll also look at two of the simplest possible techniques for Sentiment Analysis.
- Datasets and Metrics
- Text Preprocessing
- What's next ?
Datasets and Metrics
We’ll be using two standard sentiment analysis datasets of movie reviews - the Large Movie Review Dataset by Stanford AI Lab and the Rotten Tomatoes Dataset as available on a recent Kaggle competition.
Large Movie Review Dataset
This is one of the largest available movie review sentiment corpus with 50,000 reviews, equally divided into training and testing samples. It has a binary labelling scheme, with 0 representing -ve sentiment and 1 representing +ve sentiment.
Let’s look at a few sentences after running some textual cleanup and preprocessing(exact steps of preprocessing are elaborated later).
1 2 3 4 5 6 7
Label 0 : -ve review starts manager nicholas bell giving welcome investors robert carradine primal park secret project mutating primal animal using fossilized dna like jurassik park scientists resurrect one nature fearsome predators sabretooth tiger smilodon scientific ambition turns deadly however high voltage fence opened creature escape begins savagely stalking prey human visitors tourists scientific meanwhile youngsters enter restricted area security center attacked pack large pre historical animals deadlier bigger addition security agent stacy haiduk mate brian wimmer fight hardly carnivorous smilodons sabretooths course real star stars astounding terrifyingly though convincing giant animals savagely stalking prey group run afoul fight one nature fearsome predators furthermore third sabretooth dangerous slow stalks victims delivers goods lots blood gore beheading hair raising chills full scares sabretooths appear mediocre special effects story provides exciting stirring entertainment results quite boring giant animals majority made computer generator seem totally lousy middling performances though players reacting appropriately becoming food actors give vigorously physical performances dodging beasts running bound leaps dangling walls packs ridiculous final deadly scene small kids realistic gory violent attack scenes films sabretooths smilodon following sabretooth james hickox vanessa angel david keith john rhys davies much better bc roland emmerich steven strait cliff curtis camilla belle motion picture filled bloody moments badly directed george miller originality takes many elements previous films miller australian director usually working television tidal wave journey center earth many others occasionally cinema man snowy river zeus roxanne robinson crusoe rating average bottom barrel
1 2 3 4 5
Label 1 : +ve review stuff going moment mj ve started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making mj fans would say made fans true really nice actual feature bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working one kid let alone whole bunch performing complex dance scene bottom line people like mj one level another think people stay away try give wholesome message ironically mj bestest buddy girl michael jackson truly one talented people ever grace planet guilty well attention ve gave subject hmmm well know people different behind closed doors know fact either extremely nice stupid guy one sickest liars hope latter
Since these are really long, let’s instead gain some insight by looking at the Wordclouds formed by combining the positive and negative reviews:
In case you don’t know, a wordcloud is a pretty common visualization in textual data, where word sizes are proportional to their occurences in the data.
- This makes it clear that in both types of reviews, words such as film, movie, one and time are extremely frequent.
- The positive reviews have a high count of words such as good, well, great, best and similar positive words.
- The negative reviews similarly have a high count of negative words such as bad, worst, dont and nothing.
- Finally, what’s surprising is that good seems to be fairly frequent in both types of reviews. This could occur because of negations, such as not occuring before good.
Rotten Tomatoes Dataset
The Rotten Tomatoes review dataset on which we will be working has the reviews split into phrases obtained by the StanfordParser, which have been manually labelled by mechanical turks into 5 different levels.
Label 0 : --ve review would hard time sitting one
1 2 3
Label 1 : -ve review series escapades demonstrating adage good goose also good gander occasionally amuses none amounts much story
Label 2 : neutral review series escapades demonstrating adage good goose
Label 3 : +ve review good goose
Label 4 : ++ve review quiet introspective entertaining independent worth seeking
Again, let’s look at the Wordclouds for each individual label
Again, we can clearly infer from the Wordclouds, the sentiment expressed, starting from high counts of bad in the extremely negative cloud to best in the extremely positive cloud.
Throughout the course of this primer, both these datasets will be used for benchmarking as they cover a large corpus of reviews with differently configured labels.
We will be using the following metrics throughout these primers:
- Confusion Matrix
You can read more about these from here
Often referred to as Data cleaning as well, text preprocessing generally consists of the steps of getting rid of elements in the text which add little to the sentence meaning. The steps I’ve followed are:
- Stripping off markup tags such as <html> or
- Converting every alphabet to its lowercase form
- Applying any intermediary Regex based operations
- Removal of stopwords, which are extremely common words in the English language which add little meaning to sentences.
- Removal of words that are extremely short or malformed
- Removal of extra spaces in the begginning or the end
Here is a class function which achieves the desired effect:
1 2 3 4 5 6 7 8 9 10 def clean_sentence(self,sentence): if self.html_clean: # Optional flag sentence = BeautifulSoup(sentence).get_text() # removing html markup sentence = sentence.lower() # everything to lowercase for ch_rep in self.clean_list: # Optional Regex Operations sentence = re.sub(ch_rep,ch_rep,sentence) sentence = ' '.join(filter(lambda x:x not in self.stopwords_eng,sentence.split())) # Filtering stopwords sentence = ' '.join(filter(lambda x:len(x) > 1,sentence.split())) # Filtering low legnth words sentence = sentence.strip(" ") # Remove possible extra spaces return sentence
Note that this uses the BeautifulSoup4 and re external libraries for cleaning markup and applying regex operations respectively. In preprocessing the movies dataset, I’ve used the stopwords coming in nltk along with a few customized for movie reviews.
1 2 from nltk.corpus import stopwords self.stopwords_eng = stopwords.words("english") + [u"film",u"movie"]
The complete preprocessing code can be found here. Look at the DataClean class.
Simple Dictionary Lookup
A classical technique for sentiment analysis, dictionary based lookups have recieved tons of criticism for being inexhaustive, ignoring semantic meaning and many others. Yet, they were amongst the first and simplest techniques to be applied.
The steps are simple:
- Have a dictionary with a key-value pair as word:score, where score should be positive for positive words and negative for negative words.
- Start iterating through a given review word by word with a score counter of 0.
- If the word being considered is present in the dictionary, add its score to the score counter.
- The final value of the score counter and the end of the review determines the label to be assigned.
For this model we have used the AFINN dictionary
Here are some sample word-score pairs from this dictionary:
1 2 3 4 5 6 7 8 9 10 11
breathtaking 5 amazing 4 amuse 3 accomplished 2 achievable 1 some kind 0 admit -1 accusation -2 anger -3 bullshit -4 prick -5
Code for scoring each sentence is presented below:
1 2 3 4 5 6 def compute_score(self,sentence): sentence_score = 0 for word in sentence.split(): if word in self.sentiment_dict.keys(): sentence_score += self.sentiment_dict[word] return sentence_score
Note: The complete script for computing dictionary based scores is at this link. The utilities.py script contains some helper functions that can be found here. Ensure that you modify the dataset path in the load_data method correctly.
On conducting 4-fold cross validation, this model gave 35.75% on the Rotten Tomatoes 5-level sentiment data and 56.08% on the binary Large IMDb Movie Review dataset.
Finally, if your looking for a list of the best additional dictionaries to experiment with you can check this link.
The Bag of Words is one of the most common techniques of representing textual data, especially for the purpose of applying Supervised learning algorithms. Machine Learning algorithms generally require datasets to be represented as a matrix. Here represents number of samples in the dataset, which in our case would be the number of reviews. This can vary in the training and testing datasets. represents the number of features per sample which should be the same for the training and testing datasets. Maintaining a constant number of features is straightforward for numerical datasets. Here’s an example of the classical problem of Predicting whether a person will survive the Titanic Disaster(link). In the below diagram, we can clearly see that the number of features are fixed at four.
On the other hand, since text consists of streams of variable sized sentences, we need to convert the raw data to a form which can be fed to Machine Learning algorithms. This is where the Bag-of-Words(BOW in short) comes into play. Some terms to keep in mind:
- Vocabulary : This refers to a set which includes every word which is present in the data. As it is a set, each word is distinct.
- Fit and Transform : Every Machine Learning feature extraction algorithm(such as Bag of Words) consist of a fit and transform operation. The fit operation is applied only on the training data, where certain bits of information is extracted(use-case dependent). This is then used in the transform operation that is applied to both the training and testing datasets.
- Sparse Matrix : A sparse matrix is a way of representing huge matrices which have a large number of zeros in its elements. Read more here.
Computing the Bag-of-Words Matrix
- The fitting stage is carried out where the vocabulary is computed from the training dataset. Let the number of words in the vocabulary be .
- The transform operation is called next where a new matrix is first created of dimensions . In this matrix, each row corresponds to each review/sample. Since the number of columns correspond to vocabulary size, each column represents counts of terms in the vocabulary eg. if the first(1) word in vocabulary is good, the (1,1) position of the BOW matrix represents the number of times good appears in the 1st review. Similarly the (2,1) position represents the number of times good appears in the 2nd review.
It can be understood that as the number and legnth of sentences inrease the dimensions and sparsity of the BOW matrix will also increase - Thus they are generally stored in sparse matrix form.
The BOW matrix can be directly passed to a machine learning algorithm to get a pretty decent benchmark.
Here, is the code for the same: Note: The utilities.py script contains some helper functions that can be found here. Ensure that you modify the dataset path in the load_data method correctly.
As you can see most of the code uses scikit-learn, where the BOW matrix is generated using the TfIdfVectorizer. The classifier used is Naive Bayes, which in spite of being simple is surprisingly effective on text. The BOW matrix gives decent accuracy on binary labelled data of 84.21% accuracy and 56.14% on 5 level labels. Finally it must be kept in mind that BOW suffers from a major drawback of lacking semantic context since the ordering of words in the sentence is not tracked. Thus even a ML algorithm may get confused in sentences such as “not bad but good” as it does not know whether not appeared before bad or before good.
Next up, I’ll be going into the details of a completely unsupervised system for sentiment analysis, by utilising search engine results.
This list currently consists of the those documents, which I feel are easiest to follow whilst being crucial. Please feel free to provide any more suggestions.
Dictionary based Lookup
Blogs and Websites
- Basic Sentiment Analysis with Python
- Be careful with Dictionary based text analysis
- Twitter sentiment analysis using Python and NLTK