7 High performance Pandas and Apache Arrow

 

This chapter covers:

  • Optimizing memory usage with Pandas dataframe creation
  • Tips and tricks to decrease computational cost of Pandas operations
  • Using Cython, NumExpr and Numpy to accelarate Pandas operations
  • Understanding the relationship between Pandas and Apache Arrow
  • Optimizing Pandas operations by using Apache Arrow replacements
  • Using Apache Arrow to transfer dataframe operations to more efficient lower level languages

Data analytics in Python means, in many cases, Pandas. Pandas is a data frame library, a.k.a. a library to process tabular data under the assumption that the full table can be loaded in-memory. Pandas is the de facto standard in the Python world to process in-memory tabular data. In this chapter we are going to discuss approaches to optimize Pandas usage. This will be a two-pronged approach: We will optimize Pandas usage directly, and we will also see how to optimize it using Apache Arrow.

7.1 Memory and time optimization of data loading

 

7.2 Techniques to increase data analysis speed

 
 
 

7.2.1 Using indexing to accelarate access

 
 

7.2.2 Row iteration Strategies

 
 
 

7.3 Pandas on top of NumPy, Cython and NumExpr

 
 

7.3.1 Explicit use of NumPy

 
 
 

7.3.2 Pandas on top of NumExpr

 
 

7.3.3 Cython and Pandas

 
 
 

7.4 Introducing Apache Arrow and reading data into Pandas with Arrow

 
 
 

7.4.1 The relationship between Pandas and Apache Arrow

 
 

7.4.2 Reading a CSV file

 
 

7.4.3 Doing analysis with Arrow

 

7.5 Using Arrow interop to delegate work to more efficient languages and systems

 
 
 
 

7.5.1 Implications of Arrow’s language interop architecture

 
 

7.5.2 Zero-copy operations on data with Arrow’s Plasma server

 
 
 
 

7.6 Summary

 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest