4 Scaling data pipelines with Kubernetes
This chapter covers
- Understanding data pipelines
- Transitioning data engineering from a notebook to a pipeline
- Running data pipelines with Apache Airflow
- Distributed data processing pipelines using Apache Spark
In the previous chapter, we created a scalable notebook environment using Kubernetes. Through a series of steps, we deployed a production-grade Kubernetes cluster that seamlessly integrated with compute, storage, and networking services. This cluster was setup with autoscaling nodes, ensuring efficient resource utilization, accommodating fluctuating workloads with ease.
We created a notebook environment for data scientists and secured it using Keycloak, which provides centralized authentication and user management solutions. This combination empowers users to create personalized notebook servers tailored to their specific needs, while ensuring secure access and data persistence. As more users access the notebook environment, the cluster dynamically scales up to support the increasing demand, and when users log out, it gracefully scales down, optimizing resource utilization.