8 Approximate Quantiles on Data Streams

This chapter covers

Reviewing the concept of exact quantiles and understanding constraints imposed by streaming data context
Understanding different types of errors for approximate quantiles
Applying T-digest and Q-digest algorithm to a data stream
Comparing T-digest and Q-digest on a realistic web-site-time-spent data

Different algorithms presented in the previous chapter allow us to select an (un)biased sample of from all data-tuples that arrived up to the current moment. In a way, a sample is a very flexible data-sketch: you form it once and you can then use it to claim that its mean, or any other feature is a good estimate of that same feature of the whole data that came from the stream so far. Now, remember why we moved from Bernoulli sampling to sampling procedures that keep a sample of fixed size? Yes, good job, it is because Bernoulli sample grows with the number of elements seen and in combination with streaming data, this is just not practical.

8.1 Exact quantiles

8.2 Approximate Quantiles

8.2.1 Additive error

8.2.2 Relative error

8.2.3 Relative error in data domain

8.3 T-digest - how it works

8.3.1 Digest

8.3.2 Scale functions

8.3.3 Merging T-Digests

8.3.4 Space bounds for T-digest

8.4 Q-Digest

8.4.1 Constructing a Q-digest from scratch

8.4.2 Merging Q-digests

8.4.3 Error and space considerations in Q-digest

8.4.4 Quantile queries with Q-digest

8.5 Simulation code and results

8.6 Summary