concept join in category spark

This is an excerpt from Manning's book Spark in Action, Second Edition.
In one of my projects, the team needed and developed a join between two dataframes that resulted in the second dataframe being a nested document, as a column of the first dataframe. When we needed to add a third dataframe, the team considered developing a method that would take three dataframes, a master and two subdocuments, and so on. Because the operation was fairly heavy, the team wanted to optimize the number of steps. Rather than developing a method to take three dataframes as parameters, the team used the first method several times: each step was simply added to the DAG. At the end, Catalyst took the liberty of optimizing, making the code “lighter” and more readable (and cheaper to maintain).
One of the best ideas of relational databases is joins. Joins are the exploitation of the relations between the tables. This idea of building relations and joining the data is not really new (it was introduced in the great year of 1971) but has evolved. Joins are an integral part of the Spark API, as you would expect from any relational database. The support of joins enables relations between dataframes.
Listing 12.14 Joining the higher education dataset with the mapping dataset
Dataset<Row> institPerCountyDf = higherEdDf .join( #1 countyZipDf , #2 higherEdDf .col( "zip" ).equalTo( countyZipDf .col( "zip" )), #3 "inner" ); #4That’s it! I admit there was a lot of preparation to come to this point, but the join remains pretty simple. The
join()
method has several forms (see appendix M and http://mng.bz/rP9Z ). There are also quite a few join types. Join types are summarized in table 12.9. Lab #940 in this chapter’s repository executes all possible joins on a couple of dataframes. A full reference on the join operation is available in appendix M.Table 12.9 Join types in Spark (continued) (view table figure)