Lab HashSentiment - Movie Review Sentiment Analysis

The partial source code for this lab is at http://vpl.ccom.uprp.edu.

This is an adaptation of the Movie Review Sentiment Analysis Nifty Assignment by Eric D. Manley and Timothy M. Urness of Drake University (http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/). Most of the text is copied verbatim from the original assignment, with a few extra illustrations by R. Arce Nazario.

Here it goes . . .

This assignment is designed to give you experience with hash tables, and is based off of a programming competition question (https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews) regarding sentiment analysis and machine learning.

Sentiment Analysis: the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral.

Machine Learning: a branch of computer science that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions. The data that the algorithm is going to “learn” from is a set of 8,529 movie reviews in which the sentiment of each review has been manually rated on a scale from 0 to 4. The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

The data has been formatted so it is easy for C++ programs to identify each word (or punctuation). The data looks like this:

The purpose of this lab is to complete an algorithm that will use the provided data to predict the score of a new movie based on its review. For instance, it will predict that a review such as “This movie exemplary in its use of visual effects” is probably a positive review and “This is the worst movie experience in my life” is probably a negative review.

The mechanism by which we will predict the score requires that we keep the following information for each of the words in the training set:

the word: the word in lowercase
number of appearances: Number of times that word appears through the training data.
total score: The sum of the scores of the reviews where the word appears. The score is added for each appearance of the word in the review.

We will the class of object that contains the information for one movie a WordEntry (you can read the declaration in WordEntry.h)

Let’s say that our training data consists only of the following lines:

1 Aggressive self-glorification and a manipulative whitewash of history. 
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title
character undergoing midlife crisis .

Then we should keep information about the words “aggressive“, “self-glorification“, “and“, “a“, “manipulative“, “whitewash“, “of“, “history“, “comedy-drama“, “nearly“, “epic“, “proportions“, “rooted“, “in“, “sincere“, “performance“, “title“, “character“, “undergoing“, “midlife“, “crisis“. The WordEntry object for some of the words would be:

word: “a”, numAppearances: 3, totalScore: 9
word: “aggresive”, numAppearances: 1, totalScore: 1
word: “of”, numAppearances: 2, totalScore: 5
word: “crisis”, numAppearances: 1, totalScore: 4

Make sure that you understand the quantities in these objects given the example (two-line) training data.

Keeping track of the WordEntry information for all the words in the training data requires a data structure that efficiently inserts and searches objects based on a key. In this application the key is the word, the value is the rest of the information that is kept about the word (i.e. the totalScore and the numAppearances). Hash tables to the rescue!

Once the algorithm reads through the training data and computes all the necessary WordEntry objects, we will compute the score of a new review by summing the average score of each word in the review and dividing by the total of words in the review. For example, suppose that we read the two-line training data. Then the score of a review that says “Crisis of aggressive” would be:

$\frac{averageScore("crisis") + averageScore("of") + averageScore("agressive")}{3}$

$\frac{4 + 2.5 + 1}{3} = 2.5$

Thus, the complete algorithm is as follows:

For each review in the training samples:
    Read the score
    Read the rest of the line.
    For each word w in the line:
        update w’s WordEntry in the hash table, or if this is the first time that
        we encounter w, create the WordEntry for w. 


For each new review:
    sum = 0, ctr = 0
    For each word w of the new review:
        sum = sum + averageScore(w)
        ctr++
    Print sum / ctr

The starter code for this lab is in VPL. The main function (main.cpp) implements the algorithm. Your responsibility is to implement the HashTable.cpp and WordEntry.cpp files. Recommendations:

Start by implementing the WordEntry member functions. Test by creating an object and validating that it produces correct results.
Once you implement the hash table, test the implementation with a small training data set.

Example output:

enter a review -- Press return to exit: 
A weak script that ends with a quick and boring finale
The review has an average value of 1.79128
Negative Sentiment

enter a review -- Press return to exit:
Loved every minute of it
The review has an average value of 2.39219
Positive Sentiment

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search