chapter four

Chapter 4. Generating features

This chapter covers

Extracting features from raw data
Transforming features to make them more useful
Selecting among the features you’ve created
How to organize feature-generation code

This chapter is the next step on our journey through the components, or phases, of a machine learning system, shown in figure 4.1. The chapter focuses on turning raw data into useful representations called features. The process of building systems that can generate features from data, sometimes called feature engineering, can be deceptively complex. Often, people begin with an intuitive understanding of what they want the features used in a system to be, with few plans for how those features will be produced. Without a solid plan, the process of feature engineering can easily get off track, as you saw in the Sniffable example from chapter 1.

Figure 4.1. Phases of machine learning

In this chapter, I’ll guide you through the three main types of operations in a feature pipeline: extraction, transformation, and selection. Not all systems do all the types of operations shown in this chapter, but all feature engineering techniques can be thought of as falling into one of these three buckets. I’ll use type signatures to assign techniques to groups and give our exploration some structure, as shown in table 4.1.

Table 4.1. Phases of feature generation

Phase	Input	Output
Extract	RawData	Feature
Transform	Feature	Feature
Select	Set[Feature]	Set[Feature]

Chapter 4. Generating features

This chapter covers

Figure 4.1. Phases of machine learning

Table 4.1. Phases of feature generation

4.1. Spark ML

4.2. Extracting features

4.3. Transforming features

4.4. Selecting features

4.5. Structuring feature code

4.6. Applications

4.7. Reactivities

Summary