8 Storing big data

This chapter covers

Getting to know fsspec, an abstraction library over filesystems
Storing heterogeneous columnar data efficiently with Parquet
Processing data files with in-memory libraries like pandas or Parquet
Processing homogeneous multi-dimensional array data with Zarr

When dealing with big data, persistence is of paramount importance. We want to be able to access—to read and write—data as fast as possible, preferably from many parallel processes. We also want persistent representations that are compact because storing large amounts of data can be expensive.

In this chapter, we will consider several approaches to make persistent storage of data more efficient. We will start with a short discussion of fsspec, a library that abstracts access to file systems, both local and remote. While fsspec isn’t directly involved in performance problems, it is a modern library used by many applica-tions to deal with storage systems, and its use is recurrent in efficient storage implementations.

We will then consider Parquet, a file format to persist heterogeneous columnar datasets. Parquet is supported in Python via the Apache Arrow project, which was introduced in the previous chapter.

8.1 A unified interface for file access: fsspec

8.1.1 Using fsspec to search for files in a GitHub repo

8.1.2 Using fsspec to inspect zip files

8.1.3 Accessing files using fsspec

8.1.4 Using URL chaining to traverse different filesystems transparently

8.1.5 Replacing filesystem backends

8.1.6 Interfacing with PyArrow

8.2 Parquet: An efficient format to store columnar data

8.2.1 Inspecting Parquet metadata

8.2.2 Column encoding with Parquet

8.2.3 Partitioning with datasets

8.3 Dealing with larger-than-memory datasets the old-fashioned way

8.3.1 Memory mapping files with NumPy

8.3.2 Chunk reading and writing of data frames

8.4 Zarr for large-array persistence