5 Cardinality Estimation and HyperLogLog

 

This chapter covers:

  • Why cardinality estimation is important and the challenges that arise when measuring cardinality on large data
  • Practical use cases where space-efficient cardinality estimation algorithms are used
  • Teaching the incremental development of ideas leading up to and including HyperLogLog, such as probabilistic counting and LogLog
  • How HyperLogLog works, its space and error requirements and where it is used
  • Demonstrating how different cardinality estimates behave on large data using a simulation via an experiment
  • Insights into practical implementations of HyperLogLog

Determining cardinality of a multiset (a set with duplicates) is a common problem cropping up in all areas of software development, and especially applications involving databases, network traffic, etc. However, since the expansion of internet services, where billions of clicks, searches and purchases are performed daily by a much smaller number of distinct users, there is a renewed interest in this fundamental problem. Specifically, there is a great interest in developing algorithms and data structures that can estimate cardinality of a multiset in one scan of data and in the amount of space substantially smaller than the number of distinct elements.

5.1   Counting distinct items in databases

5.2   HyperLogLog incremental design

5.2.1   The first cut --- probabilistic counting

5.2.2   Stochastic averaging or, when life gives you lemons…

5.2.3   LogLog

5.2.4   HyperLogLog --- Stochastic averaging with harmonic mean

5.3   Use case: catching worms with HLL

5.4   But how does it actually work? A mini experiment

5.4.1   The effect of the number of buckets (m)

5.5   Use case: Aggregation using HyperLogLog

5.6   Summary