chapter fourteen

14. Extending transformations with user-defined functions

This chapter covers

Extending Spark with user-defined functions
Registering a UDF
Calling a UDF with the dataframe API and Spark SQL
Using UDFs for data quality within Spark
Understanding the constraints linked to UDFs

Whether you have patiently read the first 13 chapters of this book, or hopped from chapter to chapter using a helicopter reading approach, you are definitely convinced that Spark is great, but . . . is Spark extensible? You may be asking, “How can I bring my existing libraries into the mix? Do I have to use solely the dataframe API and Spark SQL to implement all the transformations I want?”

From the title of this chapter, you can imagine that the answer to the first question is yes: Spark is extensible. The rest of the chapter answers the other questions by teaching you how to use user-defined functions ( UDFs ) to accomplish those tasks. Let’s look at what this chapter articulates.

You’ll first see how Spark is extensible by looking at the architecture involving UDFs and at the impact UDFs have on your deployment.

14.1 Extending Apache Spark

14.2 Registering and calling a UDF

14. Extending transformations with user-defined functions

This chapter covers

14.1 Extending Apache Spark

14.2 Registering and calling a UDF

14.2.1 Registering the UDF with Spark

14.2.2 Using the UDF with the dataframe API

14.2.3 Manipulating UDFs with SQL

14.2.4 Implementing the UDF

14.2.5 Writing the service itself

14.3 Using UDFs to ensure a high level of data quality

14.4 Considering UDFs’ constraints

Summary