5 Cardinality Estimation and HyperLogLog
This chapter covers:
- Why cardinality estimation is important and the challenges that arise when measuring cardinality on large data
- Practical use cases where space-efficient cardinality estimation algorithms are used
- Teaching the incremental development of ideas leading up to and including HyperLogLog, such as probabilistic counting and LogLog
- How HyperLogLog works, its space and error requirements and where it is used
- Demonstrating how different cardinality estimates behave on large data using a simulation via an experiment
- Insights into practical implementations of HyperLogLog
Determining cardinality of a multiset (a set with duplicates) is a common problem cropping up in all areas of software development, and especially applications involving databases, network traffic, etc. However, since the expansion of internet services, where billions of clicks, searches and purchases are performed daily by a much smaller number of distinct users, there is a renewed interest in this fundamental problem. Specifically, there is a great interest in developing algorithms and data structures that can estimate cardinality of a multiset in one scan of data and in the amount of space substantially smaller than the number of distinct elements.