chapter fourteen

14 Building custom ML transformers and estimators

This chapter covers

Creating your own transformers, using Params for parameterization.
Creating your own estimators, using the comparion model approach.
Integrating custom transformers and estimators in a ML Pipeline.

In this chapter, we cover how to create and use custom transformers and estimators. While the ecosystem of transformers and estimators provided by PySpark covers a lot of frequent use-cases and each version brings new ones to the table, sometimes you just need to go off-trail and create your own. The alternative would be to cut your pipeline in half and insert a data transformation function into the mix. This basically nullifies all the advantages (portability, self-documentation) of the ML pipeline we in chapters 12 and 13.

Because of how similar transformers and estimators are, we start with an in-depth coverage of the transformer, and its fundamental building block, the Param. We then move to creating estimators, focussing on the differences with the transformer. Finally, we wrap it up with the integration of custom transformers and estimators in a ML Pipeline, paying attention to serialization.

14.1 Creating your own transformer

14.1.1 Designing a transformer: thinking in terms of Params and transformation

14.1.2 Creating the Params of a transformer

14.1.3 Getters and setters: being a nice PySpark citizen

14.1.4 Creating a custom transformer’s initialization function

14.1.5 Creating our transformation function

14.1.6 Using our transformer

14.2 Creating your own estimator

14.2.1 Designing our estimator: from model to params

14.2.2 Implementing the companion model: creating our own Mixin

14.2.3 Creating the `ExtremeValueCapper` estimator

14.2.4 Trying out our custom estimator.

14.3 Using our transformer and estimator in a ML pipeline

14.3.1 Dealing with multiple inputCols

14.3.2 In practice: inserting custom components into a ML pipeline

14.4 Conclusion: have data? am happy!

14.5 Summary

14.6 Additional Exercises