4 Scaling data pipelines with Kubernetes
This chapter covers
- Understanding data pipelines
- Transitioning data engineering from a notebook to a pipeline
- Running data pipelines with Apache Airflow
- Distributed data processing pipelines using Apache Spark
In the previous chapter, you learned how to create a scalable Jupyter notebook environment on Kubernetes. Combining JupyterHub with Keycloak, we empowered users to create personalized notebook servers tailored to their specific needs, while ensuring secure access and data persistence.
In this chapter, we will explore how Kubernetes helps you scale data pipelines. As you learned in Chapter 1, the efficacy of an ML model depends on the quality of data used to train the model. To prepare data for model training, data scientists perform a series of data processing steps to collect, cleanse, and structure data. This process is commonly known as data engineering.
Data pipelines are crucial in data engineering, serving as the backbone for collecting, processing, and delivering data for analysis and model training. They automate ingesting data from various sources, such as databases, APIs, streaming data, and files. They handle the extraction and integration of structured, unstructured, or semi-structured data from disparate sources into a centralized location for further processing.