concept predicate in category pyspark

This is an excerpt from Manning's book Data Analysis with Python and PySpark MEAP V07.
The predicates of a PySpark join are rules between columns of the left and right data frames. A join is performed record-wise, where each record on the left data frame is compared (via the predicates) to each record on the right data frame. If the predicates return
True
, the join is a match and is a failure ifFalse
.The best way to illustrate a predicate is to create a simple example and explore the results. For our two data frames, we will build the predicate
logs["LogServiceID"] == log_identifier["LogServiceID"]
. In plain English, we can translate this by the following.
I’ve taken a small sample of the data in both data frames and illustrated the result of applying the predicate in figure 5.1. There are two important points to highlight:
If one record in the left table resolves the predicate with more than one record in the right table (or vice versa), this record will be duplicated in the joined table.
If one record in the left or in the right table does not resolve the predicate with any record in the other table, it will not be present in the resulting table, unless the join method specifies a protocol for failed predicates.
Figure 5.1. Our predicate is applied to a sample of our two tables. 3590 on the left table resolves the predicate twice while 3417 on the left/3883 on the right do not solve it.
![]()
If you have multiple
and
predicates (such as(left["col1"] == right["colA"]) & (left["col2"] > right["colB"]) & (left["col3"] != right["colC"])
), you can put them into a list such as[left["col1"] == right["colA"], left["col2"] > right["colB"], left["col3"] != right["colC"]]
. This makes you intent more explicit and avoids counting parentheses for long chains of conditions.Finally, if you are performing an "equi-join", where you are testing for equality between identically named columns, you can just specify the name of the columns as a string or a list of strings as a predicate. In our case, it means that our predicate can only be
"LogServiceID".
This is what I put in listing 5.4.Listing 5.4. A barebone recipe for a join in PySpark, with the left and right tables filled in, as well as the predicate.
logs.join( log_identifier, on="LogServiceID" how=[METHOD] )The join method influences how you structure predicates, so Listing 5.9 revisits the whole join operation after we’re done with the ingredient-by-ingredient approach. The last parameter is the
how
, which completes our join operation.