This chapter covers
In this chapter, you will learn about using the dataframe. You’ll learn that the dataframe is so important in a Spark application because it contains typed data through a schema and offers a powerful API.
As you saw in previous chapters, Spark is a marvelous distributed analytics engine. Wikipedia defines an operating system ( OS ) as “system software that manages computer hardware [and] software resources, and provides common services for computer programs.” In chapter 1, I even qualify Spark as an operating system, as it offers all the services needed to build applications and manage resources. To use Spark in a programmatic way, you need to understand some of its key APIs. To perform analytics and data operations, Spark needs storage, both logical (at the application level) and physical (at the hardware level).
At the logical level, the favorite storage container is the dataframe , a data structure similar to a table in the relational database world. In this chapter, you will dig into the structure of the dataframe and learn how to use the dataframe via its API.