Chapter 6. Ingesting data with Spark Streaming

This chapter covers

Using discretized streams
Saving computation state over time
Using window operations
Reading from and writing to Kafka
Obtaining good performance

Real-time data ingestion, in today’s high-paced, interconnected world, is getting increasingly important. There is much talk today about the so-called Internet of Things or, in other words, a world of devices in use in our daily lives, which continually stream data to the internet and to each other and make our lives easier (in theory, at least). Even without those micro-devices overwhelming our networks with their data, many companies today need to receive data in real-time, learn from it, and act on it immediately. After all, time is money, as they say.

It isn’t hard to think of professional fields that might (and do) profit from real-time data analysis: traffic monitoring, online advertising, stock market trading, unavoidable social networks, and so on. Many of these cases need scalable and fault-tolerant systems for ingesting data, and Spark boasts all of those features. In addition to enabling scalable analysis of high-throughput data, Spark is also a unifying platform, which means you can use the same APIs from streaming and batch programs. That way, you can build both speed and batch layers of the lambda architecture (the name and the design of lambda architecture come from Nathan Marz; check out his book Big Data [Manning, 2015]).

Chapter 6. Ingesting data with Spark Streaming

This chapter covers

6.1. Writing Spark Streaming applications

6.2. Using external data sources

6.3. Performance of Spark Streaming jobs

6.4. Structured Streaming

6.5. Summary