3. The majestic role of the dataframe

This chapter covers

  • Using the dataframe
  • The essential (majestic) role of the dataframe in Spark
  • Understanding data immutability
  • Quickly debugging a dataframe’s schema
  • Understanding the lower-level storage in RDDs

In this chapter, you will learn about using the dataframe. You’ll learn that the dataframe is so important in a Spark application because it contains typed data through a schema and offers a powerful API.

As you saw in previous chapters, Spark is a marvelous distributed analytics engine. Wikipedia defines an operating system ( OS ) as “system software that manages computer hardware [and] software resources, and provides common services for computer programs.” In chapter 1, I even qualify Spark as an operating system, as it offers all the services needed to build applications and manage resources. To use Spark in a programmatic way, you need to understand some of its key APIs. To perform analytics and data operations, Spark needs storage, both logical (at the application level) and physical (at the hardware level).

At the logical level, the favorite storage container is the dataframe , a data structure similar to a table in the relational database world. In this chapter, you will dig into the structure of the dataframe and learn how to use the dataframe via its API.

3.1 The essential role of the dataframe in Spark

3.1.1 Organization of a dataframe

3.1.2 Immutability is not a swear word

3.2 Using dataframes through examples

3.2.1 A dataframe after a simple CSV ingestion

3.2.2 Data is stored in partitions

3.2.3 Digging in the schema

3.2.4 A dataframe after a JSON ingestion

3.2.5 Combining two dataframes

3.3 The dataframe is a Dataset<Row>

3.3.1 Reusing your POJOs

3.3.2 Creating a dataset of strings

3.3.3 Converting back and forth

3.4 Dataframe’s ancestor: the RDD