chapter fourteen

14 Extending transformations with user-defined functions (UDFs)

This chapter covers

Extending Apache Spark with user-defined functions (UDFs).
Registering a UDF.
Calling a UDF with the dataframe API and Spark SQL.
Using UDFs for data quality within Spark.
Understanding the constraints linked to UDFs.

Whether you have patiently read the first thirteen chapters of this book, or hopped from chapter to chapter using a helicopter reading approach, you are definitely convinced that Spark is great, but… is it extensible? You may be asking, “How can I bring my existing libraries in the mix? Do I have to use the dataframe API and Spark SQL to implement all the transformations I want?”

From the title of this chapter, you can imagine that the answer to the first question is yes, Spark is extensible. The rest of the chapter answers the other questions by teaching you how to use user-defined functions (UDFs) to accomplish those tasks.

You first look at the architecture involving UDFs and what impact UDFs have on your deployment.

Then, in section 14.2, you’ll dive into using a UDF to solve a problem: finding when the libraries in South Dublin (Ireland) are open. You will register, call, and implement the UDF. This section also contains a reminder about being a good plumber, as, I am sure, you are.

UDFs are an excellent choice for performing data quality, whether you build them yourself or use external resources. In section 14.3, I teach you how to use UDFs for data quality.

14 Extending transformations with user-defined functions (UDFs)

This chapter covers

14.1 Extending Apache Spark

14.2 Registering and calling a UDF

14.2.1 Registering the UDF with Spark

14.2.2 Using the UDF with the dataframe API

14.2.3 Manipulating UDFs with SQL

14.2.4 Implementing the UDF

14.2.5 Writing the service itself

14.3 Using UDFs for data quality

14.4 Considering UDFs’ constraints

14.5 Summary