4 Frequency Estimation and Count-Min Sketch

 

This chapter covers:

  • Understanding the streaming model and its constraints
  • Exploring practical use cases where frequency estimates arise and how count-min sketch can help
  • Learning how count-min sketch works
  • Exploring the error in count-min sketch and configuring the data structure
  • Understanding how range queries can be solved with count-min sketch
  • Exploring heavy hitters and approximate heavy hitters with count-min sketch

Measuring frequency is one of the most common operations in today’s data-intensive applications. Any kind of popularity analysis on a massive-data application, such as producing the bestseller list on Amazon.com, computing top-k  trending queries on Google, or monitoring the most frequent source-destination IP address pairs on the network are all frequency estimation problems. Estimating frequency also shows up when monitoring changes in systems that are awake 24/7, such as sensor networks or surveillance cameras. Here we can observe changes in parameters such as the temperature or location change of a sensor in the ocean, new object appearance in the frame, or the number of units by which a stock on the stock market rose or fell in a given time interval. For this purpose, in the next section, we introduce the streaming model of data that emphasizes challenges related to this particular setup.

4.1   Streaming data

4.2   Count-min sketch: how it works

4.2.1   Update

4.2.2   Estimate

4.2.3   Space and error in count-min sketch

4.3   Use cases

4.3.1   Top-k restless sleepers

4.3.2   Scaling distributional similarity of words

4.4   Range queries with count-min sketch

4.5   Approximate heavy hitters

4.5.1   Majority element

4.5.2   General heavy hitters

4.6   Summary