chapter eight
8 Filtering a DataFrame
This chapter covers:
- Reducing the memory usage of a DataFrame
- Extracting a subset of rows from a DataFrame based on one or more conditions
- Filtering for rows that include or exclude null values
- Selecting values that fall between a range
- Removing duplicate and null values from a DataFrame
8.1 Optimizing A Dataset for Memory Usage
In the previous chapters, we practiced selecting individual rows, columns and values from a DataFrame. It's now time to build upon that foundation by exploring how we extract a subset of rows based on one or more conditions.
Let's begin with the usual import of our favorite data analysis library.
In [1] import pandas as pd
The employees.csv dataset for this chapter is a fictional collection of employees at a company. Each record includes the employee's first name, gender, start date at the firm, salary, management status (True or False), and team. Let's take a peek…
In [2] pd.read_csv("employees.csv")
Out [2]
First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 8/6/93 NaN True Marketing
1 Thomas Male 3/31/96 61933.0 True NaN
2 Maria Female NaN 130590.0 False Finance
3 Jerry NaN 3/4/05 138705.0 True Finance
4 Larry Male 1/24/98 101004.0 True IT
… … … … … … …
996 Phillip Male 1/31/84 42392.0 False Finance
997 Russell Male 5/20/13 96914.0 False Product
998 Larry Male 4/20/13 60500.0 False Business Dev
999 Albert Male 5/15/12 129949.0 True Sales
1000 NaN NaN NaN NaN NaN NaN