concept join in category spark

appears as: join, joins, join, joins
Spark in Action, Second Edition

This is an excerpt from Manning's book Spark in Action, Second Edition.

In one of my projects, the team needed and developed a join between two dataframes that resulted in the second dataframe being a nested document, as a column of the first dataframe. When we needed to add a third dataframe, the team considered developing a method that would take three dataframes, a master and two subdocuments, and so on. Because the operation was fairly heavy, the team wanted to optimize the number of steps. Rather than developing a method to take three dataframes as parameters, the team used the first method several times: each step was simply added to the DAG. At the end, Catalyst took the liberty of optimizing, making the code “lighter” and more readable (and cheaper to maintain).

Figure 8.5 Spark ingests the data stored in the MySQL database after the database server performs the joins between the three tables.

One of the best ideas of relational databases is joins. Joins are the exploitation of the relations between the tables. This idea of building relations and joining the data is not really new (it was introduced in the great year of 1971) but has evolved. Joins are an integral part of the Spark API, as you would expect from any relational database. The support of joins enables relations between dataframes.

The following listing shows the code to perform the join.

Listing 12.14 Joining the higher education dataset with the mapping dataset
Dataset<Row> institPerCountyDf = higherEdDf .join( #1
             countyZipDf , #2
             higherEdDf .col( "zip" ).equalTo( countyZipDf .col( "zip" )), #3
              "inner" ); #4

That’s it! I admit there was a lot of preparation to come to this point, but the join remains pretty simple. The join() method has several forms (see appendix M and http://mng.bz/rP9Z ). There are also quite a few join types. Join types are summarized in table 12.9. Lab #940 in this chapter’s repository executes all possible joins on a couple of dataframes. A full reference on the join operation is available in appendix M.

Table 12.9 Join types in Spark (continued) (view table figure)

Join type

Aliases

Description

Inner

 

Default type of join. Selects all rows from the left dataset and the right dataset, where the join condition is met.

 

Outer

full, fullouter, full_outer

Selects data from both datasets based on the join condition and adds null when data is missing from the left or right.

 

Left

leftouter, left_outer

Selects all rows from the left dataset, as well as all rows from the right dataset for which the join condition is met.

 

Right

rightouter, right_outer

Selects all rows from the right dataset, as well as all rows from the left dataset for which the join condition is met.

 

Left-semi

left_semi

Selects rows from only the left dataset for which the join condition is met.

 

Left-anti

left_anti

Selects rows from only the left dataset for which the join condition is not met.

 

Cross

 

Performs a Cartesian join of both datasets. As a reminder, a Cartesian join (also sometimes called a Cartesian product) is a join of every row of one table to every row of another table. As an example, if table institution with 100 rows is joined with table subject with 1,000 rows, the cross- join will return 100,000 rows.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest