2 System Design and Data Pipelining
This chapter covers:
- Architecting a graph schema and data pipeline to requirements
- Working with various raw data sources and transforming them for training
- Taking an example dataset from raw data through preprocessing
- Creating datasets and data loaders with Pytorch Geometric
In general, the principles behind designing machine learning systems and building data pipelines are extensively covered elsewhere. However, developing ML systems designed for graph data requires additional considerations. This chapter will explain some of these special considerations.
Section 2.2, on system design, discusses choosing a data model and schema. Section 2.3 walks through an example data pipeline from raw data to preprocessing.
We use our social graph dataset to illustrate these ideas. In section 2.1, we give our dataset a backstory and present the raw data from which it was created.
Code from this chapter can be found in notebook form at the github repository and in Colab. Data from this chapter can be accessed in the same locations.
2.1 Social Graph Example
We return to the dataset introduced earlier, the professional social network. We’ve already discovered some information about this dataset, summarized in table 2.1. But where did this data come from?
Figure 2.1. Visualization of social network used in our example.