concept data frame in category pyspark

This is an excerpt from Manning's book Data Analysis with Python and PySpark MEAP V07.
The data frame is a stricter version of the RDD: conceptually, you can think of it like a table, where each cell can contain one value. The data frame makes heavy usage of the concept of columns where you perform operation on columns instead of on records, like in the RDD. Figure 2.2 provides a visual summary of the two structures.
Figure 2.2. A RDD vs a data frame. In the RDD, each record is processed independently. With the data frame, we work with its columns, performing functions on them.
![]()
When it comes to manipulating tabular data, SQL is the reigning king. For multiple decades now, it has been the workhorse language for relational databases, and even today, learning how to tame it is a worthwhile exercise. Spark acknowledge the power of SQL heads on: you can use a mature SQL API to transform data frames. On top of that, you can also seamlessly blend SQL code withing your Spark or PySpark program, making it easier than ever to migrate those old SQL ETL jobs without reinventing the wheel.
Listing 7.1. Trying (and failing) at querying a data frame SQL-style
try: spark.sql( "select period, count(*) from elements where phase='liq' group by period" ).show(5) except AnalysisException as e: print(e) # 'Table or view not found: elements; line 1 pos 29'Here, PySpark doesn’t make the link between the python variable
elements
, which points to the data frame, and a potential tableelements
that can be queried by Spark SQL. In order to allow a data frame to be queried via SQL, we need to register them as tables. I illustrated the process in figure 7.1 . When we assign a data frame to a variable, Python points to the data frame. Spark SQL does not have visibility over the variables Python assigns.When you want to create a table to query with Spark SQL, you can use the
createOrReplaceTempView()
method. This method takes a single string parameter which is the name of the table you want to use. This transformation will look at the data frame referenced by the Python variable on which the method was applied and will create a Spark SQL reference to the same data frame. We see an example of this in the bottom half.