chapter six

6 Working with text data

 

This chapter covers:

  • Removing whitespace from strings
  • Altering the casing of strings
  • Replacing characters in a string
  • Slicing a string on index positions
  • Splitting a string on occurrences of a delimiter

Real world data is often messy. Datasets are riddled with whitespace, improper characters, incorrect casings and more. One of the primary inspirations for the creation of Pandas was to ease the difficulty of cleaning up these improperly formatted values. This process of smoothing data into an optimal shape before analysis is called wrangling or munging. In this chapter, we'll explore the powerful methods available within the library to efficiently clean up text data.

6.1   String Casing

Let's begin by importing Pandas into our Jupyter Notebook.

In  [1] import pandas as pd

This chapter's first dataset is a listing of 150,000+ food inspections in the city of Chicago, Illinois. It includes two columns, one with the name of each establishment and the other with a risk ranking. Let's take a look.

6.2   String Slicing

6.2.1   String Slicing and Character Replacement

6.3   Boolean Methods

6.4   Splitting Strings

6.5   Coding Challenge

6.6   A Note on Regular Expressions

6.7   Summary