8 Approximate quantiles on data streams

This chapter covers

Reviewing the concept of exact quantiles and understanding constraints imposed by streaming data context
Understanding different types of errors for approximate quantiles
Applying t-digest and q-digest algorithms to a data stream
Comparing t-digest and q-digest on realistic data on length of visits to a website

Different algorithms presented in the previous chapter allow us to select an (un)biased sample from all data-tuples that have arrived up to the current moment. In a way, a sample is a very flexible datasketch: you form it once, and you can then use it to claim that its mean, or any other feature, is a good estimate of that same feature of all the data from the stream so far. Remember why we moved from Bernoulli sampling to sampling procedures that keep a sample of fixed size: because the Bernoulli sample grows with the number of elements seen and in combination with streaming data, it is just not practical.

8.1 Exact quantiles

8.2 Approximate quantiles

8.2.1 Additive error

8.2.2 Relative error

8.2.3 Relative error in the data domain

8.3 T-digest: How it works

8.3.1 Digest

8.3.2 Scale functions

8.3.3 Merging t-digests

8.3.4 Space bounds for t-digest

8.4 Q-digest

8.4.1 Constructing a q-digest from scratch

8.4.2 Merging q-digests

8.4.3 Error and space considerations in q-digests

8.4.4 Quantile queries with q-digests