chapter six

6 Streaming data: Bringing everything together

This chapter covers

Learning about the streaming data pipeline model and its distributed framework
Determining where streaming data applications and the data stream model meet
Identifying where algorithms and data structures fit in data streams
Setting up basic computing constraints and concepts inherent to data streams
Giving some probabilistic background for the next two chapters to follow

Previous chapters introduced a number of algorithms/data structures for sketching (an important characteristic) huge amounts of data residing in a database or, as you saw in the application of the HyperLogLog in network traffic surveillance, arriving and expiring at a lightning rate. In this chapter, we will round up these algorithms.

6.1 Streaming data system: A meta example

6.1.1 Bloom-join

6.1.2 Deduplication

6.1.3 Load balancing and tracking the network traffic

6.2 Practical constraints and concepts in data streams

6.2.1 In real time

6.2.2 Small time and small space

6.2.3 Concept shifts and concept drifts

6.2.4 Sliding window model

6.3 Math bit: Sampling and estimation

6.3.1 Biased sampling strategy

6.3.2 Estimation from a representative sample

Summary