concept GraphFrames in category graphx

This is an excerpt from Manning's book Spark GraphX in Action.
There is some relief, though. In chapter 10 you’ll see GraphFrames, which is a library on GitHub that does provide a subset of Neo4j’s Cypher language, together with SQL from Spark SQL, to allow for fast and convenient querying of graphs.
GraphFrames makes use of the Spark SQL component of Spark and its DataFrames API. DataFrames offers much better performance than the RDDs that GraphX uses because of two optimization layers that Spark SQL provides, known as Catalyst and Tungsten. Catalyst is the original AMPLab name of Spark SQL, but now refers to the database-style query plan optimizer part of Spark SQL. Tungsten is another, newer layer introduced in Spark 1.4 that speeds up memory access by doing direct C++ style memory access using the direct memory API that bypasses the JVM, known as sun.misc.unsafe.
For a deeper dive into Spark SQL, see Spark in Action by Petar Zečević and Marko Bonaći (Manning, 2016). For those familiar with Python, GraphFrames exposes a Python API right from the beginning, but as with using Python Spark SQL, knowing SQL is still required.
In this version of GraphFrames, for Map/Reduce type operations there’s an AggregateMessagesBuilder class, which serves a similar purpose to GraphX’s aggregateMessages(), but there’s no Pregel API. GraphFrames’s strength is in querying graphs rather than the massively parallel algorithms that are GraphX’s forte, but it would require benchmarking to determine which is faster for which application. GraphX has the optimization of maintaining routing tables internally between vertices and edges so that it can form triplets quickly. But GraphFrames has the Catalyst and Tungsten performance layers that GraphX doesn’t have.
As of Spark 1.6, GraphFrames is out on GitHub. In later versions, GraphFrames may be available on spark-packages.org (see appendix C) or as part of the Apache Spark distribution itself. To download and build the precise version used in this book, execute the following commands (for more information about Git, see Git in Practice by Mike McQuaid [Manning, 2014]):
The fundamental graph type in GraphFrames is the GraphFrame. A GraphFrame contains two DataFrames from Spark SQL (see figure 10.4), where vertices is expected to have a data column called id and edges is expected to have data columns called src and dst. Additional user columns for vertex and edge properties can be added.
Figure 10.4. Whereas the fundamental graph type in GraphX is Graph, in GraphFrames it’s GraphFrame. The parameterized type system isn’t used in GraphFrames—rather there’s a convention (enforced at runtime) where columns in the DataFrames are expected to have particular names.
![]()
The GraphFrames API provides functions to convert GraphFrames to and from GraphX Graphs. For example, assuming myGraph has been defined in the Spark Shell as from listing 4.1