chapter six

6 Working with text data

 

This chapter covers

  • Removing whitespace from strings
  • Uppercasing and lowercasing strings
  • Finding and replacing characters in strings
  • Slicing a string by character index positions
  • Splitting text by a delimiter

Text data can get quite messy. Real-world data sets are riddled with incorrect characters, improper letter casings, whitespace, and more. The process of cleaning data is called wrangling or munging. Often, the majority of our data analysis is dedicated to munging. We may know the insight we want to derive early on, but the difficulty lies in arranging the data in a suitable shape for the manipulation. Luckily for us, one of the primary motivations behind pandas was easing the difficulty of cleaning up improperly formatted text values. The library is battle-tested and flexible. In this chapter, we’ll learn how to use pandas to fix all sorts of imperfections in our text data sets. There’s a lot of ground to cover, so let’s dive right in.

6.1 Letter casing and whitespace

We’ll begin by importing pandas in a new Jupyter Notebook:

In  [1] import pandas as pd

This chapter’s first data set, chicago_food_inspections.csv, is a listing of more than 150,000 food inspections conducted across the city of Chicago. The CSV includes only two columns: one with an establishment’s name and the other with its risk ranking. The four risk levels are Risk 1 (High), Risk 2 (Medium), Risk 3 (Low), and a special All for the worst offenders:

6.2 String slicing

6.3 String slicing and character replacement

6.4 Boolean methods

6.5 Splitting strings

6.6 Coding challenge

6.6.1 Problems

6.6.2 Solutions

6.7 A note on regular expressions

Summary