3 Introducing Metaflow

 

This chapter covers

  • Defining a workflow in Metaflow that accepts input data and produces useful outputs
  • Optimizing the performance of workflows with parallel computation on a single instance
  • Analyzing the results of workflows in notebooks
  • Developing a simple end-to-end application in Metaflow

You are probably anxious to roll up your sleeves and start hacking actual code, now that we have a development environment set up. In this chapter, you will learn the basics of developing data science applications using Metaflow, a framework that shows how different layers of the infrastructure stack can work together seamlessly.

The development environment, which we discussed in the previous chapter, determines how the data scientist develops applications: by writing code in an editor, evaluating it in a terminal, and analyzing results in a notebook. On top of this toolchain, the data scientist uses Metaflow to determine what code gets written and why, which is the topic of this chapter. The next chapters will then cover the infrastructure that determines where and when the workflows are executed.

We will introduce Metaflow from the ground up. You will first learn the syntax and the basic concepts that allow you to define basic workflows in Metaflow. After this, we will introduce branches in workflows. Branches are a straightforward way to embed concurrency in workflows, which often leads to higher performance through parallel computation.

3.1 The basics of Metaflow

3.1.1 Installing Metaflow

3.1.2 Writing a basic workflow

3.1.3 Managing data flow in workflows

3.1.4 Parameters

3.2 Branching and merging

3.2.1 Valid DAG structures

3.2.2 Static branches

3.2.3 Dynamic branches

3.2.4 Controlling concurrency

3.3 Metaflow in Action

3.3.1 Starting a new project

3.3.2 Accessing results with the Client API