10 Combining data sources into a unified dataset

This chapter covers

Loading and processing raw data files
Implementing a Python class to represent our data
Converting our data into a format usable by PyTorch
Visualizing the training and validation data

Now that we’ve discussed the high-level goals for part 2, as well as outlined how the data will flow through our system, let’s get into specifics of what we’re going to do in this chapter. It’s time to implement basic data-loading and data-processing routines for our raw data. Basically, every significant project you work on will need something analogous to what we cover here.¹ Figure 10.1 shows the high-level map of our project from chapter 9. We’ll focus on step 1, data loading, for the rest of this chapter.

^1.To the rare researcher who has all of their data well prepared for them in advance: lucky you! The rest of us will be busy writing code for loading and parsing.

Figure 10.1 Our end-to-end lung cancer detection project, with a focus on this chapter’s topic: step 1, data loading

10.1 Raw CT data files

10.2 Parsing LUNA’s annotation data

10.2.1 Training and validation sets

10.2.2 Unifying our annotation and candidate data

10.3 Loading individual CT scans

10.3.1 Hounsfield Units

10.4 Locating a nodule using the patient coordinate system

10.4.1 The patient coordinate system

10.4.2 CT scan shape and voxel sizes

10.4.3 Converting between millimeters and voxel addresses

10.4.4 Extracting a nodule from a CT scan

10.5 A straightforward dataset implementation

10.5.1 Caching candidate arrays with the getCtRawCandidate function