Chapter 2. GraphX quick start


This chapter covers

  • Finding graph data to play with
  • First steps with GraphX using the Spark Shell
  • Invoking the PageRank algorithm

The Spark Shell is the easiest way to quickly start using Spark and is a great way to explore graph datasets. No compilation is necessary, which means you can focus on running commands and seeing their output. Even though Spark Shell uses Scala as its programming language, there’s no need to worry if you haven’t used Scala before. This chapter will guide you every step of the way.

The chapter is intended to walk you through the steps of working with GraphX without delving into the details. You’ll download some sample graph data consisting of bibliographic citations. Using the Spark Shell, you’ll quickly determine which paper has been cited the most frequently. More interestingly, you’ll invoke the PageRank algorithm built into GraphX to find the “most influential” paper in the graph network. In subsequent chapters, we’ll see what’s going on under the covers.

2.1. Getting set up and getting data

Although normally you would write a Spark program in Scala (or Java or Python), compile it, and submit it to a Spark cluster, Spark also offers the Spark Shell, which is an interactive shell where you can quickly test out ideas.

The first thing to do is to install Spark (this is covered in Appendix A if you haven’t done this already).

Now, assuming you have Spark installed, type


2.2. Interactive GraphX querying using the Spark Shell

2.3. PageRank example

2.4. Summary