chapter fourteen

This chapter covers

  • Extending Apache Spark with user-defined functions (UDFs).
  • Registering a UDF.
  • Calling a UDF with the dataframe API and Spark SQL.
  • Using UDFs for data quality within Spark.
  • Understanding the constraints linked to UDFs.

Whether you have patiently read the first thirteen chapters of this book, or hopped from chapter to chapter using a helicopter reading approach, you are definitely convinced that Spark is great, but… is it extensible? You may be asking, “How can I bring my existing libraries in the mix? Do I have to use the dataframe API and Spark SQL to implement all the transformations I want?”

From the title of this chapter, you can imagine that the answer to the first question is yes, Spark is extensible. The rest of the chapter answers the other questions by teaching you how to use user-defined functions (UDFs) to accomplish those tasks.

You first look at the architecture involving UDFs and what impact UDFs have on your deployment.

Then, in section 14.2, you’ll dive into using a UDF to solve a problem: finding when the libraries in South Dublin (Ireland) are open. You will register, call, and implement the UDF. This section also contains a reminder about being a good plumber, as, I am sure, you are.

UDFs are an excellent choice for performing data quality, whether you build them yourself or use external resources. In section 14.3, I teach you how to use UDFs for data quality.