This chapter covers
- Introducing computer limitations that affect the design of data-intensive applications
- Introducing and describing the external memory model (DAM model)
- Building simple scanning, searching, and merging algorithms in external memory
- Reviewing use cases where data scientists and programmers work with huge files
- Using Big-O notation to measure I/O efficiency of the algorithms
This chapter introduces fundamental ideas that form part 3 of the book. We begin by introducing external memory algorithms and the external memory model [1]. This model will teach us how to view the efficiency of algorithms and data structures in the context of working with large datasets stored on disk.
Most applications maintain data on some type of local or remote storage, files and databases being prominent examples. Storage offers the flexibility of capturing large amounts of data persistently and very cheaply. Even when the system benefits from data summaries that quickly satisfy queries from RAM, we still want to preserve the original data on some slower and larger storage. As we have seen in the case of Bloom filters and Google’s WebTable, when the query returns Present, we make a trip to disk to fetch the (key,value) pair and metadata or to establish that we have a false positive.