10 Ready, Dataset, Go!

This chapter covers:

Loading and processing our raw data files; these are the annotations that describe the location of potentially malignant parts of a CT scan
Implementing a Python class to represent our data to the rest of our project; for us, this will be the Ct class
Converting our data into a format usable by PyTorch by implementing a Dataset subclass; the LunaDataset class will combine the CT and annotation data and convert it into tensors
Visualizing the data we will be using as training and validation data for the project

Now that we’ve covered the larger project for part 2, let’s get into specifics about what we’re going to do here in chapter 10. It’s time to implement basic data loading and processing routines for our raw data. Basically every significant project you work on will need something analogous to what we cover here.^[95] Here is our high-level map of our project from chapter 9, shown here in Figure 10.1 . We’re going to be focusing on step 1, data loading for the rest of this chapter.

Figure 10.1. Our end-to-end lung cancer detection project, with a focus on this chapter’s topic; step 1, data loading.

10.1 Parsing LUNA’s annotation data

10.2 Loading individual CT scans

10.2.1 Hounsfield Units

10.3 Locating a nodule using the patient coordinate system

10.3.1 Extracting a nodule from a CT scan

10.4 A straightforward Dataset implementation

10.4.1 Caching nodule arrays with the `getCtRawNodule` function

10.4.2 Constructing our dataset in LunaDataset.{uu}init{uu}

10.4.3 A Training / Validation Split

10.4.4 Rendering the data

10.5 Conclusion

10.6 Exercises

10.7 Summary