This chapter covers
- Sampling from an infinite landmark stream
- Incorporating recency by using a sliding window and how to sample from it
- Showcasing the difference between a representative and biased sampling strategy on a landmark stream with a sudden shift
- Exploring R and Python packages and libraries for writing and executing tasks on data streams
We are ready to fully appreciate sampling as a single task staged in the analysis tier. Although we have already shown that this division of the streaming data architecture is not so clear-cut, we will imagine the stream processor sampling the incoming stream in this tier. This will help to introduce the sampling algorithm without any additional complexity coming from deduplication, merging, or general preprocessing of the data. In our fingerprint-rate example, the incoming requests will first go through IP deduplication and then appear in front of the stream processor that will materialize a representative sample. The current state of the sample is then used to answer a continuous or an ad hoc query approximately but quickly. We will use our IP sampling use case to illustrate each algorithm.