chapter seven

7 Data analysis and preparation

 

This chapter covers

  • Building and launching images for Kubeflow notebooks
  • Using Kubeflow notebooks for data analysis
  • Data passing in Kubeflow Pipelines
  • Writing Kubeflow components that pass data
  • Developing the Data Preparation pipeline for Object Detection

The landscape of Machine Learning is ever-evolving, with new developments surfacing every other week. During the era when Deep Learning took center stage, innovations like new versions of YOLO (You Only Look Once) and ResNet became the talk of the town.Nowadays (at least at this time of writing), Large Language Models (LLMs) and Visual Language Models (VLMs) have taken center stage for their performance and wide applications.

While there are constantly new architectures and techniques that capture the limelight, the success of these techniques often lie with arguably the least sexy but the most important part of Machine Learning: Data Preparation. "Garbage in, garbage out" isn't just a line that grumpy ML engineers mutter. Rather, it captures the fundamental truth that the quality and integrity of your input data ultimately shapes the reliability and efficacy of your machine learning model and their results.

7.1 Data analysis

7.1.1 Launching a notebook server in Kubeflow

7.1.2 Workspace and data volumes

7.1.3 Configurations and affinity / tolerations

7.1.4 Customizing the menu

7.1.5 Creating a custom Kubeflow notebook image

7.2 Data Passing

7.2.1 Scenario 1: Passing Simple Values to Downstream Components

7.2.2 Scenario 2: Passing Paths for Larger Data

7.2.3 Overview of KFP v2 Artifact Types

7.3 Data Preparation in action

7.3.1 Data preparation: Object detection

7.3.2 Data preparation: Movie recommender

7.4 Summary