concept data frame in category pyspark

appears as: data frame, The data frame, data frames, A data frame, data frame, data frames
Data Analysis with Python and PySpark MEAP V07

This is an excerpt from Manning's book Data Analysis with Python and PySpark MEAP V07.

The data frame is a stricter version of the RDD: conceptually, you can think of it like a table, where each cell can contain one value. The data frame makes heavy usage of the concept of columns where you perform operation on columns instead of on records, like in the RDD. Figure 2.2 provides a visual summary of the two structures.

Figure 2.2. A RDD vs a data frame. In the RDD, each record is processed independently. With the data frame, we work with its columns, performing functions on them.
ch02 rdd vs dataframe

When it comes to manipulating tabular data, SQL is the reigning king. For multiple decades now, it has been the workhorse language for relational databases, and even today, learning how to tame it is a worthwhile exercise. Spark acknowledge the power of SQL heads on: you can use a mature SQL API to transform data frames. On top of that, you can also seamlessly blend SQL code withing your Spark or PySpark program, making it easier than ever to migrate those old SQL ETL jobs without reinventing the wheel.

Listing 7.1. Trying (and failing) at querying a data frame SQL-style
try:
    spark.sql(
        "select period, count(*) from elements where phase='liq' group by period"
    ).show(5)
except AnalysisException as e:
    print(e)

# 'Table or view not found: elements; line 1 pos 29'

Here, PySpark doesn’t make the link between the python variable elements, which points to the data frame, and a potential table elements that can be queried by Spark SQL. In order to allow a data frame to be queried via SQL, we need to register them as tables. I illustrated the process in figure 7.1 . When we assign a data frame to a variable, Python points to the data frame. Spark SQL does not have visibility over the variables Python assigns.

When you want to create a table to query with Spark SQL, you can use the createOrReplaceTempView() method. This method takes a single string parameter which is the name of the table you want to use. This transformation will look at the data frame referenced by the Python variable on which the method was applied and will create a Spark SQL reference to the same data frame. We see an example of this in the bottom half.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage