In this chapter
- an introduction to stream processing
- differentiating between stream processing systems and other systems
“If it weren’t for the rocks in its bed, the stream would have no song.”
—Carl Perkins
In this chapter, we will try to answer a few basic questions about streaming systems, starting with “what is stream processing?” and “what are these stream processing systems, or streaming systems, used for?” The objective is to cover some basic ideas that will be discussed in later chapters.
Stream processing has been one of the most popular technologies in the recent years in the big data domain. Streaming systems are the computer systems that process continuous event streams.
A key characteristic of stream processing is that the events are processed as soon as (or almost as soon as) they are available. This is to minimize the latency between the original event’s entrance into the streaming system and the end result from processing the event. In most cases, the latency varies from a few milliseconds to seconds, which can be considered real-time or near real-time; hence, stream processing is also called real-time processing. From the usage point of view, stream processing is typically used for analyzing different types of events. As a result, the terms real-time analytics, streaming analytics, and event processing might also be used to reference stream processing systems in different scenarios. In this book, stream processing is the chosen term, which is well-adopted by the industry.
Examples of events:
Here are a few examples of events:
- The mouse clicks on a computer
- The taps and swipes on a cell phone
- The trains arriving at and leaving a station
- The messages and emails sent out by a person
- The temperatures collected by sensors in a laboratory
- The interactions on a website (page views, user logins, clicks, and so on) from all users
- The logs generated by computer servers in a data center
- The transactions of all accounts in a bank
Note that, typically, there isn’t a predetermined ending time for the events processed in streaming systems. You can think of them as never-ending; hence, the events are often considered continuous and unbounded. Events are everywhere—literally. We are living in the information age. A lot of data is generated, collected, and processed all the time.
Let’s look at two examples:
- The first example is a temperature-monitoring system in a laboratory. Many sensors are installed in different locations to collect temperature data every second. The streaming system is built to process the collected data and display the real-time information in a dashboard. It can also trigger alerts when any anomaly is detected. Laboratory administrators use the system to monitor all the rooms and make sure the temperature is in the right range.

- The second example is the monitoring and analyzing systems that process user interactions, such as page views, user logins, or button clicks on a website. When you visit a website, it is common that a lot of events are logged. These raw events often have many fields, so it is not efficient to digest directly. Also, some of the fields are not human-readable and need to be translated before consuming. Streaming systems are very helpful for converting the raw events data into more useful information, such as number of requests, active users, views on each page, and suspicious user behaviors, in this context.

In the examples above, a huge number of events can be processed by streaming systems to dig out useful information hidden in the data in real time. Streaming systems are very useful because there is a lot of useful information hidden in these events, and real time is critical in many cases.
A streaming system refers to a system that extracts useful information from continuous streams of events. More specifically, as we mentioned at the beginning of this section, we would like streaming systems to process the events and generate results as soon as possible after the events are collected. This is desirable because it allows the results to be available with minimal delays and the proper reactions to be performed in time. Their real-time nature makes streaming systems very useful in many scenarios, such as the laboratory and the website, where low-latency results are desired.
In the laboratory, the monitoring system can trigger alerts, start backup devices automatically, and notify the administrators, when necessary. If failed equipment is not repaired or replaced in time and the temperature is not under control, the temperature-sensitive devices and samples could be affected or damaged. Some ongoing experiments may be interrupted as well. For a website, in addition to monitoring issues, charts and dashboards generated by streaming systems could be helpful for developers to understand how users engage with the website so they can improve their products accordingly.

After seeing some examples of events and streaming systems, you should now have some ideas about what streaming systems are. The next few pages will show you how streaming systems work from a very high level by comparing them with other types of systems.
Comparison of four typical computer systems
You’ll find that stream processing systems and other computer systems have many things in common. After all, a streaming system is still a computer system. Below are a few typical systems we chose to compare:
- Applications
- Backend services
- Batch processing services
- Stream processing services

An application is a computer program that users interact with directly. Programs installed on your computer and apps installed on your smartphone are applications. For example, the calculator, text editor, music and video players, messenger, web browser, and games installed on a computer or smartphone are all applications. They are everywhere! Users interact with computers via all kinds of applications.
Users use applications to perform tasks. You can create a note or a book in a text editor and save it in a file. If you have a video file, you can use a video player application to open and play it. You can use a web browser to search for information, watch videos, and shop on the internet.
Inside an application
Applications will vary a lot. A command-line tool, a text editor, a calculator, a photo processor, a browser, and a video game look and feel significantly different from each other. Have you ever thought of them to be the same type of software? Internally, they are even more different. A simple calculator can be implemented with a few lines of code, while a web browser or a game has millions of lines in its code base.
Despite all the differences, the basic process in most applications are similar: there is a starting point (when the application is opened), an ending point (when the application is closed), and a loop (the main loop) of the following three steps:
- Get user input
- Execute logic
- Show results

A backend service is a computer program that runs behind the scenes. Different from an application, a backend service doesn’t interact with users directly. Instead, it responds to requests and performs specific tasks accordingly. A service is normally a long-running process, and it waits for incoming requests all the time.
Let’s look at a simple web service as an example. When a request is received, the program parses the requests, performs tasks accordingly, and, finally, responds. After a request is handled, the program waits for the next request again. The web service is often not working alone. It works with other services together to serve the requests. Services can handle requests from each other, and each one is responsible for a specific task. The figure below shows a web service and a storage service working together to serve a page request.

Inside a backend service, there is a main loop, too, but it works differently, because the requests processed by a service are quite different from the user inputs in an application. Because an application is normally used by a single user, checking the user input at the beginning of the main loop is normally sufficient, but in a backend service, many requests can arrive at the same time, and the requests can arrive at any moment. To handle the requests promptly, multi-threading is an important technique for this use case. A thread is a subtask executed within a process; multiple threads can exist within the context of one process. Multiple threads share the process’s resources like memory, and they can be executed concurrently.
A typical service looks like the previous diagram. When a request is received, the request handler creates a new thread to perform the real logic, and it returns immediately without waiting for the results. The time-consuming calculation (the real logic) is then performed concurrently on its own thread. This way, the main loop runs very quickly, so the new incoming requests can be accepted as soon as possible.

Both applications and backend services are designed to serve clients (human users or remote requests) as soon as possible. Batch processing systems are different. They are not designed to respond to any input. Instead, they are designed to execute tasks at scheduled times or when resources permit.
You can see real-life examples of batch processing systems fairly often. For example, in a post office, mail is collected, sorted, transported, and delivered at scheduled times because it is more efficient this way. It would be hard to imagine a system in which someone accepts your handwritten letter, runs out the door, and tries to deliver the letter to the recipient immediately. Well, it could work, but it would be super inefficient, and you would need a really good excuse to justify the effort.
Nowadays, huge amounts of data, such as articles, emails, user interactions, and the data collected from services and devices, are generated every second. It is critical and challenging to process the data and find useful information. Batch processing systems are designed for this use case.
In a typical batch processing system, the whole process is broken into multiple steps, or stages. The stages are connected by storages that store intermediate data.

In our example, the incoming data is processed in batches (an example could be user interaction data for each hour on a website). When new data is available (the whole batch is received and ready to be processed), stage 1 is started to load the data and execute its logic. The results are persisted in the intermediate storage for the following stages to pick up and process. After all the data in the batch is processed by the stage, the stage is shut down and the next stage (stage 2 in the diagram above) is started to execute on the intermediate results generated by stage 1. The processing is completed after the batch is processed by all the stages.
The batch processing architecture is a very powerful tool in the big data world. However, batch processing systems have one major limitation: latency.
Batch processing systems require data to be collected and stored as batches at regular intervals, such as hourly or daily before starting. Any events collected in a particular time window need to wait until the end of the window to be processed. This could be unacceptable in some cases, such as for the monitoring system in a laboratory, where alerts will be triggered in the following hour with a batch processing system. In these cases, it could be more desirable for data to be processed immediately after it is received—in other words, to get the results in real time. Stream processing systems are designed for these more real-time use cases. In a stream processing system, data events are processed as soon as possible once they are received.
We have used the post office as our real-world example of a batch processing system. In this system, mail is collected, transported, and delivered a few times a day at scheduled times. A real-world example of a stream processing system could be an assembly line in a factory. The assembly line has multiple steps, too, and it keeps running to accept new parts. In each step, an operation is applied to one product after another. At the end of the assembly line, the final products come out one by one.
A typical stream processing system architecture looks similar to the batch processing systems. The whole process is broken into multiple steps called components, and data keeps flowing from component to component until the processing steps have completed.

The major difference between stream processing systems and batch processing systems is that the components are long running processes. They keep running and accepting new data to process. Each event will be processed immediately by the next component after it is processed by the previous component. Therefore, the final results will be generated shortly after an event is received by the streaming system.
Both batch and stream processing systems have a multi-stage architecture. This architecture has a few advantages that make it suitable for data processing use cases:
- More flexible—Developers can add or take away stages to their jobs as they see fit.
- More scalable—Stages are connected, but each of them is independent from each other. If one stage becomes the bottleneck of the whole process with the existing instances (instances 1 through 3 in the diagram below), it is easy to bring up more instances (instances 4 and 5) to increase the throughput.
- More maintainable—Complicated processes can be composed with simple operations, which are easier to implement and maintain.

Batch processing systems
In batch processing systems, stages run independently of each other, and instances in the same stage also run independently of each other. This means they are not all running at the same time. All the instances in the system can be executed one by one or batch by batch, as long as the execution order is correct. As a result, you can build a batch processing system to process a huge (we really mean it) amount of data with very limited resources (though it will take more time to process with fewer resources). To compensate for the overhead of persistence of intermediate data, normally it is more efficient to process events in bigger batches. For example, hourly or daily are common batching windows. The events happening at the beginning of a window have to wait for the whole hour or day to be closed before being processed. This is the cause of the high latency.
One major advantage is that failure handling is easy with batch processing systems. In case an issue happens, such as a computer crashing or failing to read or write data, the failing step can simply be rescheduled on another machine and rerun.
Stream processing systems
On the streaming side of things, all the steps are long running processes. Events are transferred from one to another continuously. As a result, we don’t have the ability to stop stages when they are not working properly anymore, and failure handling becomes more complicated. However, events are being processed as soon as possible, so we can get real-time results.
Let’s compare the systems we have introduced in this section to have a better idea how different types of computer systems work.
Application |
Backend service |
Batch processing system |
Stream processing system |
Process user inputs |
Process requests |
Process data |
Process data |
Interact with users directly |
Interact with clients and other services directly. Interact with users indirectly. |
Apply operations on data. The results can be consumed by users directly or indirectly. |
Apply operations on data. The results can be consumed by users directly or indirectly. |
Applications are started and stopped by users. |
Instances of a service are long running processes. |
Instances in the system are scheduled to start and stop. |
Instances in the system are long running processes. |
Single main loop |
Single main loop with threads |
Multi-stage process |
Multi-stage process |

After looking at a few different systems, let’s focus on stream processing systems. From the previous section, you have learned that a streaming system consists of multiple long running component processes.

The answer to the question depends on the systems you want to build. What do you want to do? How big is the traffic? How many resources do you have? How will you manage these resources? How will you recover from a failure? How will you make sure the results are correct after the recovery? There are many questions to consider when building a stream processing system. So, the answer seems to be a yes?
Well, yes, streaming systems can be fairly complicated, but they are not that hard to build either. In the next chapters, we are going to learn how to build streaming systems and how they work internally. Are you ready?
In this chapter, we learned that stream processing is a data processing technology that processes continuous events to get real-time results. We also studied and compared typical architectures of four different types of computer systems to understand how stream processing systems differ from the others:
- Applications
- Backend services
- Batch processing systems
- Stream processing systems
- Can you think of more examples of applications, services, batch processing systems, and stream processing systems?