chapter eight

8 Storing big data

 

This chapter covers:

  • Getting to know fsspec, an abstraction library over file systems
  • Efficiently storing heterogeneous columnar data with Parquet
  • How to process data files that are larger than memory with in-memory libraries like Pandas or Parquet
  • Efficiently process homogeneous multi-dimensional array data with zarr

When dealing with big data, persistence is of paramount importance. We want to be able to access—both read and write—data as fast as possible, preferably from many parallel processes. We also want persistence representations that are compact, as storing large amounts of data can be expensive.

In this chapter we are going to consider several approaches to do persistent storage of data in a more efficient way. We start with a short discussion of fsspec, a library that abstracts access to file systems, local or remote. While this isn’t directly involved in performance issues, it is a modern library used by many applications to deal with storage systems and its use is recurrent in efficient storage implementations.

We then introduce Parquet: a file format to persist heterogeneous columnar datasets. Parquet is supported in Python via the Apache Arrow project which we introduced in the previous chapter.

8.1 A unified interface for file access: fsspec

8.1.1 Looking at a GitHub repository as a file system

8.1.2 Using fsspec to inspect Zip files

8.1.3 Accessing files using fsspec

8.1.4 Using URL chaining to traverse different file systems transparently

8.1.5 Replacing file system backends

8.1.6 Interfacing with PyArrow

8.2 An efficient format to store columnar data: Parquet

8.2.1 Inspecting Parquet meta data

8.2.2 Column encoding with Parquet

8.2.3 Partitioning with datasets

8.3 Dealing with larger than memory datasets the old fashioned way

8.3.1 Memory mapping files with NumPy

8.3.2 Chunk reading and writing of data frames