chapter seven

7 Sampling from data streams

This chapter covers:

How to sample from an infinite landmark stream – Bernoulli Sampling, Reservoir Sampling and Biased Reservoir Sampling
How to incorporate recency by using sliding window and how to sample from it – Chain Sampling and Priority Sampling
Showcasing the difference between representative and biased sampling strategy on a landmark stream with sudden shift - Reservoir Sampling vs. Biased Reservoir Sampling

With section 6.3 under our belt, we are ready to fully appreciate sampling as a single task staged in the Analysis Tier. Although we showed that this division of the streaming data architecture is not so clear-cut, we will imagine the stream processor sampling the incoming stream in this tier. This will help to introduce the sampling algorithm without any additional complexity coming from de-duplication, merging or generally preprocessing of the data. In our fingerprint-rate example, the incoming requests would first go through IP de-duplication and then appear in front of the stream processor that would materialize a representative sample. The current state of the sample is then used to answer a continuous or an ad-hoc query approximately, but quickly. We will use our IP sampling use-case to illustrate each algorithm.

7.1 Sampling from a landmark stream

7.1.1 Bernoulli sampling

7.1.2 Reservoir sampling

7.1.3 Biased reservoir sampling

7.2 Sampling from a sliding window

7.2.1 Chain Sampling

7.2.2 Priority Sampling

7.3 Sampling algorithms comparison

7.3.1 Simulation set up, algorithms and data

7.4 Summary