chapter five

5 Cardinality estimation and HyperLogLog

This chapter covers

Practical use cases where space-efficient cardinality estimation algorithms are used
Teaching the incremental development of ideas leading up to and including HyperLogLog, such as probabilistic counting and LogLog
How HyperLogLog works, its space and error requirements, and where it is used
How different cardinality estimates behave on large data using a simulation via an experiment
Insights into practical implementations of HyperLogLog

Determining the cardinality of a multiset (a set with duplicates) is a common problem cropping up in all areas of software development, and especially in applications involving databases, network traffic, and so on. However, since the expansion of internet services, where billions of clicks, searches, and purchases are performed daily by a much smaller number of distinct users, there is renewed interest in this fundamental problem. Specifically, there is great interest in developing algorithms and data structures that can estimate the cardinality of a multiset in one scan of data and in an amount of space substantially smaller than the number of distinct elements.

5.1 Counting distinct items in databases

5.2 HyperLogLog incremental design

5.2.1 The first cut: Probabilistic counting

5.2.2 Stochastic averaging, or “when life gives you lemons”

5.2.3 LogLog

5.2.4 HyperLogLog: Stochastic averaging with harmonic mean

5.3 Use case: Catching worms with HLL

5.4 But how does it work? A mini experiment

5.4.1 The effect of the number of buckets (m)

5.5 Use case: Aggregation using HyperLogLog

Summary