5 Natural language processing: Classifying social media sentiment

 

This chapter covers

  • Preparing text vectorization for quantitative features
  • Practicing cleaning and tokenizing raw text into features
  • Extracting and learning features with deep learning
  • Taking advantage of transfer learning with BERT

Our last two case studies focused on completely different domains but had a major component in common: we were working with structured tabular data. In the next two case studies, we are going to look at special cases where we need to deploy specific feature engineering techniques to make machine learning possible. In this case study, we will be looking at techniques from the world of natural language processing (NLP), which is a branch of ML focused on working with raw text data.

As discussed in previous chapters, unstructured data are widely prevalent, and data scientists often need to perform machine learning tasks on unstructured data like text and images. A common NLP task is performing text classification or text regression, which consists of performing classification or regression given only raw text.

5.1 The tweet sentiment dataset

5.1.1 The problem statement and defining success

5.2 Text vectorization

5.2.1 Feature construction: Bag of words

5.2.2 Count vectorization

5.2.3 TF-IDF vectorization

5.3 Feature improvement

5.3.1 Cleaning noise from text

5.3.2 Standardizing tokens

5.4 Feature extraction

5.4.1 Singular value decomposition

5.5 Feature learning

5.5.1 Introduction to autoencoders