In this chapter, we cover how to create and use custom transformers and estimators. While the ecosystem of transformers and estimators provided by PySpark covers a lot of frequent use cases and each version brings new ones to the table, sometimes you just need to go off trail and create your own. The alternative is to cut your pipeline in half and insert a data transformation function into the mix. This basically nullifies all the advantages (portability, self-documentation) of the ML pipeline that we covered in chapters 12 and 13.
Because of how similar transformers and estimators are, we start with in-depth coverage of the transformer and its fundamental building block, the Param. We then move on to creating estimators, focusing on the differences in the transformer. Finally, we conclude with the integration of custom transformers and estimators in an ML pipeline, paying attention to serialization.