chapter twelve

12 Combining data sources into a unified dataset

This chapter covers

Loading and processing raw data files
Implementing a Python class to represent our data
Converting our data into a format usable by PyTorch
Visualizing the training and validation data

Now that we’ve discussed the high-level goals for our project, as well as outlined how the data will flow through our system, let’s get into specifics of what we’re going to do in this chapter. It’s time to implement basic data-loading and data-processing routines for our raw data. The techniques we cover here are foundational and will be applicable to any major project you undertake.^[1] Figure 12.1 shows the high-level map of our project from chapter 11. We’ll focus on step 1, data loading, for the rest of this chapter.

Figure 12.1 Our end-to-end lung cancer detection project, with a focus on this chapter’s topic: step 1, data loading

Our goal is to be able to produce a training sample given our inputs of raw CT scan data and a list of annotations for those CTs. This might sound simple, but quite a bit needs to happen before we can load, process, and extract the data we’re interested in. Figure 12.2 shows what we’ll need to do to turn our raw data into a training sample. Luckily, we got a head start on understanding our data in the last chapter, but we have more work to do on that front as well.

Figure 12.2 The data transforms required to make a sample tuple. These sample tuples will be used as input to our model training routine.

12.1 Raw CT data files

12.2 Parsing LUNA’s annotation data

12.2.1 Training and validation sets

12 Combining data sources into a unified dataset

This chapter covers

Figure 12.1 Our end-to-end lung cancer detection project, with a focus on this chapter’s topic: step 1, data loading

Figure 12.2 The data transforms required to make a sample tuple. These sample tuples will be used as input to our model training routine.

12.1 Raw CT data files

12.2 Parsing LUNA’s annotation data

12.2.1 Training and validation sets

12.2.2 Unifying our annotation and candidate data

12.3 Loading individual CT scans

12.3.1 Hounsfield Units

12.4 Locating a nodule using the patient coordinate system

12.4.1 The patient coordinate system

12.4.2 CT scan shape and voxel sizes

12.4.3 Converting between millimeters and voxel addresses

12.4.4 Extracting a nodule from a CT scan

12.5 Straightforward dataset implementation

12.5.1 Caching candidate arrays with the getCtRawCandidate function

12.5.2 Constructing our dataset in LunaDataset.init

12.5.3 A training/validation split

12.5.4 Rendering the data

12 Combining data sources into a unified dataset

This chapter covers

Figure 12.1 Our end-to-end lung cancer detection project, with a focus on this chapter’s topic: step 1, data loading

Figure 12.2 The data transforms required to make a sample tuple. These sample tuples will be used as input to our model training routine.

12.1 Raw CT data files

12.2 Parsing LUNA’s annotation data

12.2.1 Training and validation sets

12.2.2 Unifying our annotation and candidate data

12.3 Loading individual CT scans

12.3.1 Hounsfield Units

12.4 Locating a nodule using the patient coordinate system

12.4.1 The patient coordinate system

12.4.2 CT scan shape and voxel sizes

12.4.3 Converting between millimeters and voxel addresses

12.4.4 Extracting a nodule from a CT scan

12.5 Straightforward dataset implementation

12.5.1 Caching candidate arrays with the getCtRawCandidate function

12.5.2 Constructing our dataset in LunaDataset.__init__

12.5.3 A training/validation split

12.5.4 Rendering the data

12.5.2 Constructing our dataset in LunaDataset.init