chapter two

2 System Design and Data Pipelining

This chapter covers:

Architecting a graph schema and data pipeline to requirements
Working with various raw data sources and transforming them for training
Taking an example dataset from raw data through preprocessing
Creating datasets and data loaders with Pytorch Geometric

In general, the principles behind designing machine learning systems and building data pipelines are extensively covered elsewhere. However, developing ML systems designed for graph data requires additional considerations. This chapter will explain some of these special considerations.

Section 2.2, on system design, discusses choosing a data model and schema. Section 2.3 walks through an example data pipeline from raw data to preprocessing.

We use our social graph dataset to illustrate these ideas. In section 2.1, we give our dataset a backstory and present the raw data from which it was created.

Code from this chapter can be found in notebook form at the github repository and in Colab. Data from this chapter can be accessed in the same locations.

2.1 Social Graph Example

We return to the dataset introduced earlier, the professional social network. We’ve already discovered some information about this dataset, summarized in table 2.1. But where did this data come from?

Figure 2.1. Visualization of social network used in our example.

A picture containing night sky Description automatically generated

2 System Design and Data Pipelining

This chapter covers:

2.1 Social Graph Example

Figure 2.1. Visualization of social network used in our example.

2.2 GNN System Planning

2.2.1 Project Objectives and Scope

2.2.2 Designing Graph Data Models and Schema

2.3 A Data Pipeline Example

2.3.1 Raw Data

2.3.2 ETL

2.3.3 Data Exploration and Visualization

2.3.4 Preprocessing: Pytorch Geometric

2.4 Summary

2.5 References