chapter four

4 Scaling data pipelines with Kubernetes

This chapter covers

Understanding data pipelines
Transitioning data engineering from a notebook to a pipeline
Running data pipelines with Apache Airflow
Distributed data processing pipelines using Apache Spark

In the previous chapter, you learned how to create a scalable Jupyter notebook environment on Kubernetes. Combining JupyterHub with Keycloak, we empowered users to create personalized notebook servers tailored to their specific needs, while ensuring secure access and data persistence.

In this chapter, we will explore how Kubernetes helps you scale data pipelines. As you learned in Chapter 1, the efficacy of an ML model depends on the quality of data used to train the model. To prepare data for model training, data scientists perform a series of data processing steps to collect, cleanse, and structure data. This process is commonly known as data engineering.

Data pipelines are crucial in data engineering, serving as the backbone for collecting, processing, and delivering data for analysis and model training. They automate ingesting data from various sources, such as databases, APIs, streaming data, and files. They handle the extraction and integration of structured, unstructured, or semi-structured data from disparate sources into a centralized location for further processing.

4.1 Understanding the basics of data processing

4.2 Data processing in pipelines

4.3 Introducing Apache Airflow

4.4 Installing Airflow

4.4.1 Configuring Keycloak to authenticate Airflow users

4.4.2 Deploying Airflow using Helm

4.4.3 Exploring Airflow Architecture

4.4.4 Executors

4.4.5 Airflow Operators

4.5 Creating your first DAG

4.6 Loading DAGs in Airflow

4.7 Running your first DAG

4.8 Passing data between tasks

4.8.1 Passing large amounts of data between tasks

4.9 Transitioning from notebook to pipeline

4.9.1 Providing secrets to tasks

4.9.2 Handling dependencies

4.9.3 Adding data processing step to the pipeline

4.10 Configuring the default properties of worker Pods

4.11 Managing resources

4.11.1 Airflow Pools

4.11.2 Limiting resources in Kubernetes

4.12 Using Apache Spark in pipelines

4.12.1 Running Spark jobs on Kubernetes

4.12.2 Creating Airflow DAGs for Spark Job

4.13 Architecting for scale

4.14 Summary