4 Frequency Estimation and Count-Min Sketch

This chapter covers:

Understanding the streaming model and its constraints
Exploring practical use cases where frequency estimates arise and how count-min sketch can help
Learning how count-min sketch works
Exploring the error in count-min sketch and configuring the data structure
Understanding how range queries can be solved with count-min sketch
Exploring heavy hitters and approximate heavy hitters with count-min sketch

Measuring frequency is one of the most common operations in today’s data-intensive applications. Any kind of popularity analysis on a massive-data application, such as producing the bestseller list on Amazon.com, computing top-k trending queries on Google, or monitoring the most frequent source-destination IP address pairs on the network are all frequency estimation problems. Estimating frequency also shows up when monitoring changes in systems that are awake 24/7, such as sensor networks or surveillance cameras. Here we can observe changes in parameters such as the temperature or location change of a sensor in the ocean, new object appearance in the frame, or the number of units by which a stock on the stock market rose or fell in a given time interval. For this purpose, in the next section, we introduce the streaming model of data that emphasizes challenges related to this particular setup.

4.1 Streaming data

4.2 Count-min sketch: how it works

4.2.1 Update

4.2.2 Estimate

4.2.3 Space and error in count-min sketch

4.3 Use cases

4.3.1 Top-k restless sleepers

4.3.2 Scaling distributional similarity of words

4.4 Range queries with count-min sketch

4.5 Approximate heavy hitters

4.5.1 Majority element

4.5.2 General heavy hitters

4.6 Summary