Chapter 5. Algorithms for data analysis
This chapter covers
- Querying a stream
- Thinking about time
- Understanding four powerful summarization techniques
Chapter 4 covered how the data flows through many stream-processing frameworks, the delivery semantics, and fault tolerance. In this chapter we’re going to depart from the architectural views and discuss the algorithmic side of stream processing, often called streaming analytics or stream mining. We will focus on the what and why of streaming analysis algorithms and occasionally dip our toes into the detailed how. Don’t worry if you’re looking for the detailed math or code behind the algorithms—ample resources will be provided so that you can continue your learning.
Before we begin, I’ll talk about how we perform queries with these tools. In general, there are two types of queries that you may want to execute in a streaming system:
- Ad-hoc queries— These are queries asked one time about a stream. For example: What is the maximum value seen so far in the stream? This style of query is the same kind you would execute against an RDBMS.
- Continuous queries— These are queries that are, in essence, asked about the stream at all times. For example: Determine the maximum value ever seen in the stream emitted every five minutes and generate an alert if it exceeds a given threshold.