9 Strings

 

When most people think of pandas or data analysis in general, they think of numbers. And indeed, much of the work that people do with pandas is with numbers. That’s why pandas is built on top of NumPy, which takes advantage of C’s fast, efficient integers and floats. And that’s why so many of the exercises in this book involve working with numbers.

However, we often have to work with textual data—usernames, product names, sales regions, business units, ticker symbols, and company names are just a few examples. Sometimes the text is central to the analysis you’re doing—such as when you’re preparing data for a text-based machine-learning model—and other times, it’s secondary to the numbers and used as a description or categorical data.

It turns out that pandas is also well-equipped to handle text. It does this not by storing string data in NumPy but rather by using fully fledged string objects: either those that come with Python or (more recently) a pandas-specific string class that reduces both ambiguity and errors. (I’ll have more to say about these two string types and when to use each one later in the chapter.) In either case, we can apply a wide variety of string methods to our data.

Exercise 36 Analyzing Alice

Working it out

Solution

Beyond the exercise

Exercise 37 Wine words

Working it out

Solution

Beyond the exercise

Exercise 38 Programmer salaries

Working it out

Solution

Beyond the exercise

Summary