Part 3. Applications and Libraries for Modern Data Processing

 

Part 3 of the book is most directly applied to data problems, as it covers widely used Python analysis libraries. We first discuss the ever-present pandas library to process data frames. We also look at Apache Arrow, a modern library that can, among other tasks, help speed up pandas processing. We then discuss libraries designed to extract maximum performance from persistence. We examine Zarr for N-dimensional arrays and Parquet for data frames. The topic of dealing with larger-than-memory datasets is also introduced.