14 Data-processing pipelines

 

This chapter covers

  • Organizing data-processing pipelines via streaming data
  • Efficient input/output
  • Parsing data
  • Manipulating complex data structures

Many applications share a similar structure: they read a data source, transform data to a form suitable for processing, process data somehow, prepare results, and present those results to a user. We call this structure a data-processing pipeline. The stockquotes project from chapter 3 was a simple example of such an application. In real-world applications, we need to take special care to keep performance and memory usage under control.

Figure 14.1 presents a pipeline starting from a data source and ending up at a resulting output. The term pipeline suggests having different stages. We do different things at those stages. At every stage, we have data in some form in the beginning and produce data in some new form as a result.

Figure 14.1 General data-processing pipeline

In this chapter, we’ll discuss Haskell tools that allow us to implement data-processing pipelines. Although particular stages can be very specific, we need to organize them at least. We’ll start with discussing a problem and implementations of streaming as a solution for organizing pipelines and stages.

14.1 Streaming data

14.1.1 General components and naive implementation

14.1.2 The streaming package

14.2 Approaching an implementation of pipeline stages

14.2.1 Reading and writing data efficiently

14.2.2 Parsing data with parser combinators

14.2.3 Accessing data with lenses

14.3 Example: Processing COVID-19 data

14.3.1 The task

14.3.2 Processing data

14.3.3 Organizing the pipeline

Summary