Chapter 4. Generating features

 

This chapter covers

  • Extracting features from raw data
  • Transforming features to make them more useful
  • Selecting among the features you’ve created
  • How to organize feature-generation code

This chapter is the next step on our journey through the components, or phases, of a machine learning system, shown in figure 4.1. The chapter focuses on turning raw data into useful representations called features. The process of building systems that can generate features from data, sometimes called feature engineering, can be deceptively complex. Often, people begin with an intuitive understanding of what they want the features used in a system to be, with few plans for how those features will be produced. Without a solid plan, the process of feature engineering can easily get off track, as you saw in the Sniffable example from chapter 1.

Figure 4.1. Phases of machine learning

In this chapter, I’ll guide you through the three main types of operations in a feature pipeline: extraction, transformation, and selection. Not all systems do all the types of operations shown in this chapter, but all feature engineering techniques can be thought of as falling into one of these three buckets. I’ll use type signatures to assign techniques to groups and give our exploration some structure, as shown in table 4.1.

Table 4.1. Phases of feature generation

Phase

Input

Output

Extract RawData Feature
Transform Feature Feature
Select Set[Feature] Set[Feature]

4.1. Spark ML

4.2. Extracting features

4.3. Transforming features

4.4. Selecting features

4.5. Structuring feature code

4.6. Applications

4.7. Reactivities

Summary