7 High performance Pandas and Apache Arrow

This chapter covers:

Optimizing memory usage with Pandas dataframe creation
Tips and tricks to decrease computational cost of Pandas operations
Using Cython, NumExpr and Numpy to accelarate Pandas operations
Understanding the relationship between Pandas and Apache Arrow
Optimizing Pandas operations by using Apache Arrow replacements
Using Apache Arrow to transfer dataframe operations to more efficient lower level languages

Data analytics in Python means, in many cases, Pandas. Pandas is a data frame library, a.k.a. a library to process tabular data under the assumption that the full table can be loaded in-memory. Pandas is the de facto standard in the Python world to process in-memory tabular data. In this chapter we are going to discuss approaches to optimize Pandas usage. This will be a two-pronged approach: We will optimize Pandas usage directly, and we will also see how to optimize it using Apache Arrow.

7.1 Memory and time optimization of data loading

7.2 Techniques to increase data analysis speed

7.2.1 Using indexing to accelarate access

7.2.2 Row iteration Strategies

7.3 Pandas on top of NumPy, Cython and NumExpr

7.3.1 Explicit use of NumPy

7.3.2 Pandas on top of NumExpr

7.3.3 Cython and Pandas

7.4 Introducing Apache Arrow and reading data into Pandas with Arrow

7.4.1 The relationship between Pandas and Apache Arrow

7.4.2 Reading a CSV file

7.4.3 Doing analysis with Arrow

7.5 Using Arrow interop to delegate work to more efficient languages and systems

7.5.1 Implications of Arrow’s language interop architecture

7.5.2 Zero-copy operations on data with Arrow’s Plasma server

7.6 Summary