chapter four

4 Frequency estimation and count-min sketch

This chapter covers

Exploring practical use cases where frequency estimates arise and how count-min sketch can help
Learning how count-min sketch works
Exploring use cases of a sensor and an NLP app
Learning about the error versus space tradeoff in count-min sketch
Understanding dyadic ranges and how range queries can be solved with count-min sketch

Popularity analysis, such as producing a bestseller list on an e-commerce site, computing top-k trending queries on a search engine, or reporting frequent source-destination IP address pairs on a network, is a common problem in today’s data-intensive applications. Anomaly detection (i.e., monitoring changes in systems that are awake 24/7, such as sensor networks or surveillance cameras) falls under the same algorithmic umbrella as measuring popularity. Anomaly detection is often observed through a sudden spike in the value of a certain parameter, such as the temperature or location change in a sensor, an object appearance in the frame, or the number of units by which a company’s stock rose or fell in a given time interval.

4.1 Majority element

4.1.1 General heavy hitters

4.2 Count-min sketch: How it works

4.2.1 Update

4.2.2 Estimate

4.3 Use cases

4.3.1 Top-k restless sleepers

4.3.2 Scaling the distributional similarity of words

4.4 Error vs. space in count-min sketch

4.5 A simple implementation of count-min sketch

4.5.1 Exercises

4.5.2 Intuition behind the formula: Math bit

4.6 Range queries with count-min sketch

4.6.1 Dyadic intervals