chapter eight

8 Filtering a DataFrame

 

This chapter covers:

  • Reducing the memory usage of a DataFrame
  • Extracting a subset of rows from a DataFrame based on one or more conditions
  • Filtering for rows that include or exclude null values
  • Selecting values that fall between a range
  • Removing duplicate and null values from a DataFrame

8.1      Optimizing A Dataset for Memory Usage

In the previous chapters, we practiced selecting individual rows, columns and values from a DataFrame. It's now time to build upon that foundation by exploring how we extract a subset of rows based on one or more conditions.

Let's begin with the usual import of our favorite data analysis library.

In [1] import pandas as pd

The employees.csv dataset for this chapter is a fictional collection of employees at a company. Each record includes the employee's first name, gender, start date at the firm, salary, management status (True or False), and team. Let's take a peek…

In [2] pd.read_csv("employees.csv")
 
Out [2]
 
First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 8/6/93 NaN True Marketing
1 Thomas Male 3/31/96 61933.0 True NaN
2 Maria Female NaN 130590.0 False Finance
3 Jerry NaN 3/4/05 138705.0 True Finance
4 Larry Male 1/24/98 101004.0 True IT
… … … … … … …
996 Phillip Male 1/31/84 42392.0 False Finance
997 Russell Male 5/20/13 96914.0 False Product
998 Larry Male 4/20/13 60500.0 False Business Dev
999 Albert Male 5/15/12 129949.0 True Sales
1000 NaN NaN NaN NaN NaN NaN

8.1.1   Converting Data Types with the as_type Method

8.2      Filtering by a Single Condition

8.3      Filtering by Multiple Conditions

8.3.1   The AND Condition

8.3.2   The OR Condition

8.3.3   Inversion with ~

8.3.4   Methods for Booleans

8.4      Filtering by Condition

8.4.1   The isin Method

8.4.2   The between Method

8.4.3   The isnull and notnull Methods

8.4.4   Dealing with Null Values

8.5      Dealing with Duplicates

8.5.1   The duplicated Method

8.5.2   The drop_duplicates Method

8.6      Coding Challenge

8.6.1   The Problem

8.6.2   Solutions

8.7            Summary