6 Sentiment classification: large movie-review dataset

 

This chapter covers

  • Using text and word frequency (Bag of Words) to represent sentiment
  • Building sentiment classifier using logistic regression and with softmax
  • Measuring classification accuracy
  • Computing ROC curve and measure classifier effectiveness
  • Submitting your results to the Kaggle challenge for Movie Reviews

One of the magic uses of machine learning that impresses everyone nowadays is teaching the computer to learn from text. With social media, SMS text, Facebook messenger, What’s App, Twitter and other sources generating hundreds of billions of text messages a day, there is no shortage of text to learn from.

See for yourself

Check out this famous infographic demonstrating the abundance of textual data arriving each day from various media platforms: https://www.textrequest.com/blog/how-many-texts-people-send-per-day/.

Social media companies, phone providers, and app makers are all trying to use the messages you send to make decisions and classify you. Have you ever sent your significant other an SMS text message about the Thai food you ate for lunch and then later saw ads on your social media pop up recommending new Thai restaurants to visit? Scary as it seems that big brother is trying to identify and understand your food habits, there are also very practical applications used by online streaming service companies trying to determine if you enjoyed their films or not.

6.1           The Bag of Words model

6.1.1   Applying the Bag of Words model to Movie Reviews

6.1.2   Cleaning all the movie reviews

6.1.3   Exploratory Data Analysis on your Bag of Words

6.2           Building a sentiment classifier using logistic regression

6.2.1   Setting up the training for your model

6.2.2   Performing the training for your model

6.3           Making predictions using your sentiment classifier

6.4           Measuring the effectiveness of your classifier

6.5           Creating the softmax-regression sentiment classifier

6.6           Submit your results to Kaggle

6.7           Summary