In Part 2, we’ll look at an unsupervised approach proposed by Peter Turney in his seminal paper titled Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. The approach is unique compared to traditional dictionary and BOW based techniques, in that it is completely unsupervised and makes use of information retrieval from Knowledge Bases(via Search Engines).
Note : Information about datasets, metrics and preprocessing being used has already been outlined in Sentiment Analysis Primer - Part 1
- Search Engine Scraping : Be Wary!
- Technique Overview
- Phrase extraction
- The PMI-IR score
- The Code
- What's next?
Search Engine Scraping : Be Wary!
This line of research depends on the good will of major search engines.
Use of screen scrapers is disallowed by these engines. However they do provide APIs:
- Google Custom Search API. Read more here and here : ~40 requests/hour
- Microsoft Bing Search API and the python-bing API client : ~5K requests/month
The author of the paper has himself used the Alta Vista search engine which was popular in the pre-Google days.
The above diagram shows gives a rough overview of how the proposed technique works. The main steps involved in obtaining the semantic orientation score are Phrase Extraction and PMI Scoring.
Step 1 : Phrase Extraction
The heavy dependence of this technique on querying commercial search engines coupled with the importance of only specific substrings in adding to sentence sentiment means that it would be most sensible to extract and query only such specific phrases. The author has proposed that only those phrases following specific part of speech patterns be queried to obtain the final score. The table of acceptable patterns is shown below:
The steps for phrase extraction are:
- Run a Part of Speech tagger(such as the one provided by NLTK) on the concerned sentence.
- Check every triplet word combination pattern. If a valid pattern is found in correspondence with the above table then pass it for querying and scoring.
Step 2 : The PMI-IR score
PMI-IR stands for Pointwise Mutual Information score which can be denoted as:
where hits(x) = Number of results returned when x is issued as a query
This can be understood from the diagram below:
Finally, the Semantic Orientation score can be calculated by
This can also be expressed as(by applying log multiplication rule):
Note: The utilities.py script contains some helper functions that have been used in the above code and can be found here. Ensure that you modify the dataset path in the load_data method correctly.
The proposed technique is reported to have an accuracy of approximately 66% on movie review data from the Epinions(reference pending) dataset. Due to :
- The restrictions on screen scraping my modern search engines
- Strict restrictions on number of API calls by search engine APIs
I have not run a test on the complete dataset. Tests on a small subsample gave decent results. I have currently thought out a plan to overcome the screen scraping limitations, but this is currently in development.
Next up, I’ll be starting off on using a more recent method of representing textual data known as Word Embeddings.