8. Ingestion from databases

 

This chapter covers

  • Ingesting data from relational databases
  • Understanding the role of dialects in communication between Spark and databases
  • Building advanced queries in Spark to address the database prior to ingestion
  • Understanding advanced communication with databases
  • Ingesting from Elasticsearch

In the big data and enterprise context, relational databases are often the source of the data on which you will perform analytics. It makes sense to understand how to extract data from those databases, both through the whole table or through SQL SELECT statements.

In this chapter, you’ll learn several ways to ingest data from those relational databases, ingesting either the full table at once or asking the database to perform some operations before the ingestion. Those operations could be filtering, joining, or aggregating data at the database level to minimize data transfer.

You will see in this chapter which databases are supported by Spark. When you work with a database not supported by Spark, a custom dialect is required. The dialect is a way to inform Spark of how to communicate with the database. Spark comes with a few dialects and, in most cases, you won’t need to even think about them. However, for those special situations, you’ll learn how to build one.

Figure 8.1 This chapter focuses on ingestion from databases, whether the database is supported by Spark, or is not supported and requires a custom dialect to be used.

8.1 Ingestion from relational databases

8.1.1 Database connection checklist

8.1.2 Understanding the data used in the examples

8.1.3 Desired output

8.1.4 Code

8.1.5 Alternative code

8.2 The role of the dialect

8.2.1 What is a dialect, anyway?

8.2.2 JDBC dialects provided with Spark

8.2.3 Building your own dialect

8.3 Advanced queries and ingestion

8.3.1 Filtering by using a WHERE clause

8.3.2 Joining data in the database

8.3.3 Performing Ingestion and partitioning

8.3.4 Summary of advanced features

sitemap