chapter seven

7 Memory management with Python

This chapter covers

How to profile your code for memory usage and issues
Handling and consuming large datasets
Optimizing data types for memory
Training an ML model when your data doesn’t fit in memory
Making use of Python’s data structures for memory efficiency

For this chapter, our primary dataset will be the ad clicks dataset available at this link: https://www.kaggle.com/competitions/avazu-ctr-prediction/data. The training dataset available here has over 40 million rows. Many of the techniques we will discuss in this chapter are also applicable for even larger datasets, such as ones in the billions of rows.

In Chapter 6 (Section 71.2), we covered using line-profiler to profile your code for computational speed / efficiency issues. This helped us to easily identify what points in our code are taking the longest to run. We’re going to get started in this chapter by discussing memory profiling, which is a similar mechanism for identifying what points in your code cause the highest amount of memory consumption.

7.1 Memory profiler

A memory profiler is a tool that allows you to identify how much memory is being consumed in various actions in your code. Similar to what we covered in the last chapter around computational profiling, we can perform an analagous check for memory.

7.1.1 High-level memory summaries with guppy

7.1.2 Analyzing your memory consumption line by line with memory-profiler

7.2 Sampling and chunking large datasets

7.2.1 Reading from a large CSV file using chunks

7.2.2 Random selection

7.2.3 Chunking when reading from a database

7.3 Optimizing data types for memory

7.3.1 Checking data types

7.3.2 How to check the memory usage of a data frame

7.3.3 How to check the memory usage of a column

7.3.4 Converting numeric data types to be more memory-efficient

7.3.5 Category data type

7.3.6 Sparse data type

7.3.7 Specifying data types when reading in a dataset

7.3.8 Summary of data types and memory

7.4 Limiting number of columns

7.5 Processing workflow for individual chunks of data