7 Sampling from data streams
This chapter covers:
- How to sample from an infinite landmark stream – Bernoulli Sampling, Reservoir Sampling and Biased Reservoir Sampling
- How to incorporate recency by using sliding window and how to sample from it – Chain Sampling and Priority Sampling
- Showcasing the difference between representative and biased sampling strategy on a landmark stream with sudden shift - Reservoir Sampling vs. Biased Reservoir Sampling
With section 6.3 under our belt, we are ready to fully appreciate sampling as a single task staged in the Analysis Tier. Although we showed that this division of the streaming data architecture is not so clear-cut, we will imagine the stream processor sampling the incoming stream in this tier. This will help to introduce the sampling algorithm without any additional complexity coming from de-duplication, merging or generally preprocessing of the data. In our fingerprint-rate example, the incoming requests would first go through IP de-duplication and then appear in front of the stream processor that would materialize a representative sample. The current state of the sample is then used to answer a continuous or an ad-hoc query approximately, but quickly. We will use our IP sampling use-case to illustrate each algorithm.