chapter twelve

12 Combining data sources into a unified dataset

 

This chapter covers

  • Loading and processing raw data files
  • Implementing a Python class to represent our data
  • Converting our data into a format usable by PyTorch
  • Visualizing the training and validation data

Now that we’ve discussed the high-level goals for our project, as well as outlined how the data will flow through our system, let’s get into the specifics of what we’re going to do in this chapter. It’s time to implement basic data-loading and data-processing routines for our raw data. The techniques we cover here are foundational and will be applicable to any major project you undertake. To the rare researcher who has all their data well prepared for them in advance: lucky you! The rest of us will be busy writing code for loading and parsing. Figure 12.1 shows the high-level map of our project from chapter 11. We’ll focus on step 1, data loading, for the rest of this chapter.

Figure 12.1 Our end-to-end lung-cancer-detection project, with a focus on this chapter’s topic: step 1, data loading
figure

12.1 Raw CT data files

12.2 Parsing LUNA’s annotation data

12.2.1 Training and validation sets

12.2.2 Unifying our annotation and candidate data

12.3 Loading individual CT scans

12.3.1 Hounsfield Units

12.4 Locating a nodule using the patient coordinate system

12.4.1 The patient coordinate system

12.4.2 CT scan shape and voxel sizes

12.4.3 Converting between millimeters and voxel addresses

12.4.4 Extracting a nodule from a CT scan

12.5 Straightforward dataset implementation

12.5.1 Caching candidate arrays with the getCtRawCandidate function

12.5.2 Constructing our dataset in LunaDataset.__init__

12.5.3 A training/validation split

12.5.4 Rendering the data

12.6 Conclusion

12.7 Exercises

Summary