5 Filtering a DataFrame

 

This chapter covers

  • Reducing a DataFrame’s memory use
  • Extracting DataFrame rows by one or more conditions
  • Filtering a DataFrame for rows that include or exclude null values
  • Selecting column values that fall between a range
  • Removing duplicate and null values from a DataFrame

In chapter 4, we learned how to extract rows, columns, and cell values from a DataFrame by using the loc and iloc accessors. These accessors work well when we know the index labels and positions of the rows/columns we want to target. Sometimes, we may want to target rows not by an identifier but by a condition or a criterion. We may want to extract a subset of rows in which a column holds a specific value, for example.

In this chapter, we’ll learn how to declare logical conditions that include and exclude rows from a DataFrame. We’ll see how to combine multiple conditions by using AND and OR logic. Finally, we’ll introduce some pandas utility methods that simplify the filtering process. Lots of fun lies ahead, so let’s jump in.

5.1 Optimizing a data set for memory use

5.1.1 Converting data types with the astype method

5.2 Filtering by a single condition

5.3 Filtering by multiple conditions

5.3.1 The AND condition

5.3.2 The OR condition

5.3.3 Inversion with ~

5.3.4 Methods for Booleans

5.4 Filtering by condition

5.4.1 The isin method

5.4.2 The between method

5.4.3 The isnull and notnull methods

5.4.4 Dealing with null values

5.5 Dealing with duplicates

5.5.1 The duplicated method

5.5.2 The drop_duplicates method

5.6 Coding challenge

5.6.1 Problems

5.6.2 Solutions

Summary