6 Grouping, joining, and sorting

 

So far, we have looked at how to create data frames, read data into them, clean the data, and then analyze that clean, imported data in a number of ways. But analysis sometimes requires more than just the basics: we often need to break our input data apart, zoom in on particularly interesting subsets, combine data from different sources, transform the data into a new format or value, and then sort it according to a variety of criteria. This type of action is known in the pandas world as split-apply-combine, and it is our focus in this chapter. If you have experience with SQL and relational databases, you’ll find many similarities, in both principle and name, to functionality in pandas.

For example, a company may want to determine its total sales in the last quarter. It may also want to learn which countries have done particularly well (or poorly). Or perhaps the head of sales would like to see how much each individual salesperson has brought in, or how much each product has contributed to the company’s income.

These types of questions can be answered using a technique known as grouping. Much like the GROUP BY clause in an SQL query, we can use grouping in pandas to ask the same question for various subsets of our data.

Exercise 29 Longest taxi rides

Working it out

Solution

Beyond the exercise

Exercise 30 Taxi ride comparison

Working it out

Solution

Beyond the exercise

Exercise 31 Tourist spending per country

Working it out

Solution

Beyond the exercise

Summary