This chapter covers:
- Optimizing memory usage with Pandas dataframe creation
- Tips and tricks to decrease computational cost of Pandas operations
- Using Cython, NumExpr and Numpy to accelarate Pandas operations
- Understanding the relationship between Pandas and Apache Arrow
- Optimizing Pandas operations by using Apache Arrow replacements
- Using Apache Arrow to transfer dataframe operations to more efficient lower level languages
Data analytics in Python means, in many cases, Pandas. Pandas is a data frame library, a.k.a. a library to process tabular data under the assumption that the full table can be loaded in-memory. Pandas is the de facto standard in the Python world to process in-memory tabular data. In this chapter we are going to discuss approaches to optimize Pandas usage. This will be a two-pronged approach: We will optimize Pandas usage directly, and we will also see how to optimize it using Apache Arrow.