This chapter covers
- Reviewing the concept of exact quantiles and understanding constraints imposed by streaming data context
- Understanding different types of errors for approximate quantiles
- Applying t-digest and q-digest algorithms to a data stream
- Comparing t-digest and q-digest on realistic data on length of visits to a website
Different algorithms presented in the previous chapter allow us to select an (un)biased sample from all data-tuples that have arrived up to the current moment. In a way, a sample is a very flexible datasketch: you form it once, and you can then use it to claim that its mean, or any other feature, is a good estimate of that same feature of all the data from the stream so far. Remember why we moved from Bernoulli sampling to sampling procedures that keep a sample of fixed size: because the Bernoulli sample grows with the number of elements seen and in combination with streaming data, it is just not practical.