8 Approximate Quantiles on Data Streams
This chapter covers
- Reviewing the concept of exact quantiles and understanding constraints imposed by streaming data context
- Understanding different types of errors for approximate quantiles
- Applying T-digest and Q-digest algorithm to a data stream
- Comparing T-digest and Q-digest on a realistic web-site-time-spent data
Different algorithms presented in the previous chapter allow us to select an (un)biased sample of from all data-tuples that arrived up to the current moment. In a way, a sample is a very flexible data-sketch: you form it once and you can then use it to claim that its mean, or any other feature is a good estimate of that same feature of the whole data that came from the stream so far. Now, remember why we moved from Bernoulli sampling to sampling procedures that keep a sample of fixed size? Yes, good job, it is because Bernoulli sample grows with the number of elements seen and in combination with streaming data, this is just not practical.