This chapter covers
- Creating your own transformers, using Params for parameterization.
- Creating your own estimators, using the comparion model approach.
- Integrating custom transformers and estimators in a ML Pipeline.
In this chapter, we cover how to create and use custom transformers and estimators. While the ecosystem of transformers and estimators provided by PySpark covers a lot of frequent use-cases and each version brings new ones to the table, sometimes you just need to go off-trail and create your own. The alternative would be to cut your pipeline in half and insert a data transformation function into the mix. This basically nullifies all the advantages (portability, self-documentation) of the ML pipeline we in chapters 12 and 13.
Because of how similar transformers and estimators are, we start with an in-depth coverage of the transformer, and its fundamental building block, the Param. We then move to creating estimators, focussing on the differences with the transformer. Finally, we wrap it up with the integration of custom transformers and estimators in a ML Pipeline, paying attention to serialization.