chapter fourteen

14 Building custom ML transformers and estimators

This chapter covers

Creating your own transformers using Params for parameterization
Creating your own estimators using the companion model approach
Integrating custom transformers and estimators in an ML Pipeline

In this chapter, we cover how to create and use custom transformers and estimators. While the ecosystem of transformers and estimators provided by PySpark covers a lot of frequent use cases and each version brings new ones to the table, sometimes you just need to go off trail and create your own. The alternative is to cut your pipeline in half and insert a data transformation function into the mix. This basically nullifies all the advantages (portability, self-documentation) of the ML pipeline that we covered in chapters 12 and 13.

Because of how similar transformers and estimators are, we start with in-depth coverage of the transformer and its fundamental building block, the Param. We then move on to creating estimators, focusing on the differences in the transformer. Finally, we conclude with the integration of custom transformers and estimators in an ML pipeline, paying attention to serialization.

14.1 Creating your own transformer

14.1.1 Designing a transformer: Thinking in terms of Params and transformation

14.1.2 Creating the Params of a transformer

14.1.3 Getters and setters: Being a nice PySpark citizen

14.1.4 Creating a custom transformer’s initialization function

14.1.5 Creating our transformation function

14.1.6 Using our transformer

14.2 Creating your own estimator

14.2.1 Designing our estimator: From model to params

14.2.2 Implementing the companion model: Creating our own Mixin

14.2.3 Creating the ExtremeValueCapper estimator

14.2.4 Trying out our custom estimator

14.3 Using our transformer and estimator in an ML pipeline

14.3.1 Dealing with multiple inputCols

14.3.2 In practice: Inserting custom components into an ML pipeline

Summary

Conclusion: Have data, am happy!