Part 3 Data structures for databases and external memory algorithms

 

While parts 1 and 2 were concerned with squeezing and sampling data to make it fit into RAM, we can now finally breathe a sigh of relief—our data, all of it, is comfortably resting on disk. In the three chapters in part 3, we will learn how to effectively design algorithms and data structures for large datasets sitting on disk. This will include understanding how retrieval, insertion, and deletion work in different kinds of databases and how to efficiently sort large files on disk. We will also delve into differences in the design of indices between read-optimized and write-optimized databases. The first step in doing all of this will be understanding how the I/O cost (i.e., the cost of transferring one block of data from disk to main memory) dominates the cost of a CPU operation by three or more degrees of magnitude. Thus, the lens through which we will observe the algorithm efficacy will blur everything happening in RAM and zoom in on the data transfer between disk and RAM. Learning how to do Big-O analysis from the perspective of the disk transfer will be one of the major takeaways of part 3.