concept predicate in category pyspark

appears as: predicate, The predicates, predicates, predicates
Data Analysis with Python and PySpark MEAP V07

This is an excerpt from Manning's book Data Analysis with Python and PySpark MEAP V07.

The predicates of a PySpark join are rules between columns of the left and right data frames. A join is performed record-wise, where each record on the left data frame is compared (via the predicates) to each record on the right data frame. If the predicates return True, the join is a match and is a failure if False.

The best way to illustrate a predicate is to create a simple example and explore the results. For our two data frames, we will build the predicate logs["LogServiceID"] == log_identifier["LogServiceID"]. In plain English, we can translate this by the following.

I’ve taken a small sample of the data in both data frames and illustrated the result of applying the predicate in figure 5.1. There are two important points to highlight:

  • If one record in the left table resolves the predicate with more than one record in the right table (or vice versa), this record will be duplicated in the joined table.
  • If one record in the left or in the right table does not resolve the predicate with any record in the other table, it will not be present in the resulting table, unless the join method specifies a protocol for failed predicates.
  • Figure 5.1. Our predicate is applied to a sample of our two tables. 3590 on the left table resolves the predicate twice while 3417 on the left/3883 on the right do not solve it.
    ch05 predicates

    If you have multiple and predicates (such as (left["col1"] == right["colA"]) & (left["col2"] > right["colB"]) & (left["col3"] != right["colC"])), you can put them into a list such as [left["col1"] == right["colA"], left["col2"] > right["colB"], left["col3"] != right["colC"]]. This makes you intent more explicit and avoids counting parentheses for long chains of conditions.

    Finally, if you are performing an "equi-join", where you are testing for equality between identically named columns, you can just specify the name of the columns as a string or a list of strings as a predicate. In our case, it means that our predicate can only be "LogServiceID". This is what I put in listing 5.4.

    Listing 5.4. A barebone recipe for a join in PySpark, with the left and right tables filled in, as well as the predicate.
    logs.join(
        log_identifier,
        on="LogServiceID"
        how=[METHOD]
    )

    The join method influences how you structure predicates, so Listing 5.9 revisits the whole join operation after we’re done with the ingredient-by-ingredient approach. The last parameter is the how, which completes our join operation.

    sitemap

    Unable to load book!

    The book could not be loaded.

    (try again in a couple of minutes)

    manning.com homepage
    test yourself with a liveTest