4 Bloom Filters:
Reducing the Memory Needed to Keep Track of Content

This chapter covers

Describing and analyzing Bloom filters
Solving a problem: keeping track of large documents using little memory
Showing why dictionaries are an imperfect solution
Improving the memory print by using Bloom filters
Recognizing use cases where Bloom filters improve performance
Using metrics to tune the quality of Bloom filters’ solution

Starting with this chapter we’ll be reviewing less common data structures that solve - as strange as it might seem - common problems. Bloom filters are one of the most prominent examples: they are widely used in most industries, but not as widely known as you would expect for such a cornerstone.

In section 4.1, we introduce the problem that will be our North star in this chapter: we need to keep track of large entities, with the smallest memory print possible.

In section 4.2 we continue our narration by discussing a few increasingly complex solutions, showing their strengths and weaknesses: the latter, in particular, ought to be considered chances of improvements, fertile ground for algorithms designers.

As part of this discussion, we introduce the dictionary, an abstract data type that we discuss in-depth in section 4.3, while section 4.4 switches to concrete data structures that implements dictionaries: hash tables, binary search trees, and bloom filters.

4.1 The Dictionary Problem: Keeping Track of Things

4.2 Alternatives to Implement a Dictionary

4.3 Describing the Data Structure API: Associative Array

4.4 Concrete Data Structures

4.4.1 Unsorted Array: Fast Insertion, Slow Search

4.4.2 Sorted Arrays and Binary Search: Slow Insertion, Fast(-ish) Search

4.4.3 Hash Table: Constant-Time on Average, Unless You Need Ordering

4.4.4 Binary Search Tree: Every Operation is Logarithmic

4.4.5 Bloom Filter: as Fast as Hash Tables, But Saving Memory (with a catch)

4.5 Under the Hood - How Do Bloom Filters Work

4.6 Implementation

4.6.1 Using a Bloom Filter

4.6.2 Read and Write Bits

4.6.3 Find Where a Key is Stored

4.6.4 Generate Hash Functions

4.6.5 Constructor

4.6.6 Checking a Key

4.6.7 Storing a Key

4.6.8 Estimating Accuracy

4.7 Applications

4 Bloom Filters: Reducing the Memory Needed to Keep Track of Content