14. Extending transformations with user-defined functions

 

This chapter covers

  • Extending Spark with user-defined functions
  • Registering a UDF
  • Calling a UDF with the dataframe API and Spark SQL
  • Using UDFs for data quality within Spark
  • Understanding the constraints linked to UDFs

Whether you have patiently read the first 13 chapters of this book, or hopped from chapter to chapter using a helicopter reading approach, you are definitely convinced that Spark is great, but . . . is Spark extensible? You may be asking, “How can I bring my existing libraries into the mix? Do I have to use solely the dataframe API and Spark SQL to implement all the transformations I want?”

From the title of this chapter, you can imagine that the answer to the first question is yes: Spark is extensible. The rest of the chapter answers the other questions by teaching you how to use user-defined functions ( UDFs ) to accomplish those tasks. Let’s look at what this chapter articulates.

You’ll first see how Spark is extensible by looking at the architecture involving UDFs and at the impact UDFs have on your deployment.

14.1 Extending Apache Spark

14.2 Registering and calling a UDF

14.2.1 Registering the UDF with Spark

14.2.2 Using the UDF with the dataframe API

14.2.3 Manipulating UDFs with SQL

14.2.4 Implementing the UDF

14.2.5 Writing the service itself

14.3 Using UDFs to ensure a high level of data quality

14.4 Considering UDFs’ constraints

Summary