chapter four

4 Scaling data pipelines with Kubernetes

This chapter covers

Understanding data pipelines
Transitioning data engineering from a notebook to a pipeline
Running data pipelines with Apache Airflow
Distributed data processing pipelines using Apache Spark

In the previous chapter, we created a scalable notebook environment using Kubernetes. Through a series of steps, we deployed a production-grade Kubernetes cluster that seamlessly integrated with compute, storage, and networking services. This cluster was setup with autoscaling nodes, ensuring efficient resource utilization, accommodating fluctuating workloads with ease.

We created a notebook environment for data scientists and secured it using Keycloak, which provides centralized authentication and user management solutions. This combination empowers users to create personalized notebook servers tailored to their specific needs, while ensuring secure access and data persistence. As more users access the notebook environment, the cluster dynamically scales up to support the increasing demand, and when users log out, it gracefully scales down, optimizing resource utilization.

4.1 Understanding the basics of data processing

4.2 Data processing in pipelines

4.3 Introducing Apache Airflow

4.3.1 What are DAGs?

4.4 Installing Airflow

4.4.1 Configuring Keycloak to authenticate Airflow users

4.4.2 Deploying Airflow using Helm

4.4.3 Exploring Airflow Architecture

4.4.4 Executors

4.4.5 Operators

4.5 Creating your first DAG

4.6 Loading DAGs in Airflow

4.7 Running your first DAG

4.8 Passing data between tasks

4.8.1 Passing large amounts of data between tasks

4.9 Transitioning from notebook to pipeline

4.9.1 Providing secrets to tasks

4.9.2 Handling dependencies

4.9.3 Adding data processing step to the pipeline

4.10 Configuring the default properties of worker Pods

4.11 Managing resources

4.11.1 Airflow Pools

4.11.2 Capping resources in Kubernetes

4.12 Using Apache Spark in pipelines

4.12.1 Running Spark jobs on Kubernetes

4.12.2 Creating Airflow DAGs for Spark Job

4.13 Architecting for scale

4.14 Summary