chapter seven

7 High-performance pandas and Apache Arrow

This chapter covers

Optimizing memory usage with pandas’ data frame creation
Decreasing computational cost of pandas operations
Using Cython, NumExpr, and Numpy to accelerate pandas operations
Optimizing pandas with Apache Arrow

Data analytics is essentially synonymous with using pandas. pandas is a data frame library, or a library to process tabular data. pandas is the de facto standard in the Python world to process in-memory tabular data. In this chapter, we will discuss approaches to optimize pandas usage. This will be a two-pronged approach: we will optimize pandas usage directly, and we will also optimize it using Apache Arrow.

Apache Arrow provides language-agnostic functionality to efficiently access columnar data, to share these data across different language implementations, and to transfer data to different processes and even to different computers. It can complement pandas from a performance perspective by introducing faster algorithms to perform basic operations, such as reading CSV files, translating pandas data frames to the format of lower-level languages for faster processing, and enhancing serialization mechanisms to transfer data frames across different computers.

7.1 Optimizing memory and time when loading data

7.1.1 Compressed vs. uncompressed data

7.1.2 Type inference of columns

7.1.3 The effect of data type precision

7.1.4 Recoding and reducing data

7.2 Techniques to increase data analysis speed

7.2.1 Using indexing to accelerate access

7.2.2 Row iteration strategies

7.3 pandas on top of NumPy, Cython, and NumExpr

7.3.1 Explicit use of NumPy

7.3.2 pandas on top of NumExpr

7.3.3 Cython and pandas

7.4 Reading data into pandas with Arrow

7.4.1 The relationship between pandas and Apache Arrow

7.4.2 Reading a CSV file

7.4.3 Analyzing with Arrow