chapter seven

7 Data analysis and preparation

This chapter covers

Building and launching images for Kubeflow notebooks
Using Kubeflow notebooks for data analysis
Passing data in Kubeflow Pipelines
Writing Kubeflow components that pass data
Developing the data preparation pipeline for object detection

The landscape of machine learning (ML) is ever-evolving, with new developments surfacing every other week. During the era when deep learning took center stage, innovations such as new versions of You Only Look Once (YOLO) and ResNet became the talk of the town. Nowadays (at least at the we wrote this), large language models (LLMs) and visual language models (VLMs) have taken center stage for their performance and wide applications.

While there are constantly new architectures and techniques that capture the limelight, the success of these techniques often lie with arguably the least sexy but the most important part of ML: data preparation. “Garbage in, garbage out” isn’t just a line that grumpy ML engineers mutter. Rather, it captures the fundamental truth that the quality and integrity of your input data ultimately shapes the reliability and efficacy of your ML model and results.

7.1 Data analysis

7.1.1 Launching a notebook server in Kubeflow

7.1.2 Workspace and data volumes

7.1.3 Configurations and affinity/tolerations

7.1.4 Customizing the menu

7.1.5 Creating a custom Kubeflow notebook image

7.2 Data passing

7.2.1 Scenario 1: Passing simple values to downstream components

7.2.2 Scenario 2: Passing paths for larger data

7.2.3 Overview of KFP v2 artifact types

7.3 Data preparation in action

7.3.1 Data preparation: Object detection

7.3.2 Data preparation: Movie recommender

Summary