chapter seven

7 Sampling from data streams

This chapter covers

Sampling from an infinite landmark stream
Incorporating recency by using a sliding window and how to sample from it
Showcasing the difference between a representative and biased sampling strategy on a landmark stream with a sudden shift
Exploring R and Python packages and libraries for writing and executing tasks on data streams

We are ready to fully appreciate sampling as a single task staged in the analysis tier. Although we have already shown that this division of the streaming data architecture is not so clear-cut, we will imagine the stream processor sampling the incoming stream in this tier. This will help to introduce the sampling algorithm without any additional complexity coming from deduplication, merging, or general preprocessing of the data. In our fingerprint-rate example, the incoming requests will first go through IP deduplication and then appear in front of the stream processor that will materialize a representative sample. The current state of the sample is then used to answer a continuous or an ad hoc query approximately but quickly. We will use our IP sampling use case to illustrate each algorithm.

7.1 Sampling from a landmark stream

7.1.1 Bernoulli sampling

7.1.2 Reservoir sampling

7.1.3 Biased reservoir sampling

7.2 Sampling from a sliding window

7.2.1 Chain sampling

7.2.2 Priority sampling

7.3 Sampling algorithms comparison

7.3.1 Simulation setup: Algorithms and data

Summary