Spark GraphX in Action cover
welcome to this free extract from
an online version of the Manning book.
to read more
or

About this Book

published book

With Spark GraphX in Action we hope to bring down to earth the sometimes esoteric topic of graphs, while explaining how to use them from the in-memory distributed computing framework that has gained the most mindshare, Apache Spark.

Who should read this book

We assume the reader has no previous knowledge of Spark, Scala, and graphs, but we move so quickly through the material that previous exposure to at least one of these would be helpful. We attempt to be particularly gentle with our use of Scala. We provide a brief introduction to Scala in chapter 3 and Scala tips throughout the book whenever a new Scala concept is introduced (these are listed in appendix D). In fact, we have recommended this book as a concise introduction to Scala, pointing to chapter 3, the Scala tips, and appendix D.

In addition, we completely avoid the mathematical proofs that are common in college courses in graph theory. Our focus is on graph algorithms and applications, and sometimes we pull in graph structure terminology as needed.

We target version Spark/GraphX 1.6 in this book.

The intended reader is someone who has a lot of development experience in some programming language such as Java, but graphs lend themselves so naturally to illustrations that non-developers will be able to glean ideas about what graphs can be used for.

How this book is organized

This book is divided into three parts. Part 1 consists of three chapters that cover the prerequisites to using Spark GraphX. The four chapters in part 2 cover standard and expected ways to use GraphX, and the three chapters in part 3 are on advanced topics.

We also could have divided the book into two parts, with the first five chapters covering the prerequisites and basic GraphX API, and the last five chapters covering ways to apply GraphX.

Here’s a run-down of the ten chapters:

  • Chapter 1 sets the stage with what Big Data, Spark, and graphs are, and how Spark GraphX fits into a processing data flow. Chapter 1 is a mini-book unto itself—not in length, but in its breadth of overview.
  • Chapter 2 is a very brief, hands-on demonstration of using GraphX—no experience required.
  • Chapter 3 covers the prerequisites of Spark, Scala, and graphs.
  • Chapter 4 discusses how to do basic Spark GraphX operations and presents the two main methods of implementing custom GraphX algorithms: Map/Reduce and Pregel.
  • Chapter 5 illustrates how to use the numerous algorithms built into GraphX.
  • Chapter 6 is where something outside the API is finally covered. Here we take some of the classic mid-20th century graph algorithms and show how they can be implemented in GraphX.
  • Chapter 7 is a lengthy and ambitious chapter on machine learning. Normally this would require a book unto itself, but here we cover machine learning without assuming any prior knowledge or experience and quickly ramp up to advanced examples of supervised, unsupervised, and semi-supervised learning.
  • Chapter 8 shows how some operations can be done in GraphX that one might assume would come built into a graph-processing package: reading RDF files, merging graphs, finding graph isomorphisms, and computing the global clustering coefficient.
  • Chapter 9 shows how to monitor performance and see what your GraphX application is doing. It then shows how to do performance tuning through techniques like caching, checkpointing, and serializer tuning.
  • Chapter 10 describes how to use languages other than Scala with GraphX (but strongly advises against it) and also discusses how to use tools that complement GraphX. It demonstrates Apache Zeppelin notebook software with GraphX to provide visualization of graphs inline with an interactive notebook shell. The third-party tool Spark JobServer can be used to convert GraphX from a mere batch graph processing system to an online database of sorts. Finally, GraphFrames is a library on GitHub (developed by some of the developers of GraphX) that uses Spark SQL DataFrames rather than RDDs to provide a convenient and high-performing way to query graphs.

We also include four appendixes in the book. Appendix A addresses installing Spark and appendix B gives a brief overview of Gephi visualization software. In appendix C you’ll find a number of online resources for additional information about GraphX and where to go to keep up with latest developments. Finally, appendix D lists the Scala tips given throughout the book.

Anyone new to Spark, Scala, or graphs should progress through the first five chapters linearly. After that, you can pick and choose topics from the last five chapters.

Anyone who is expert in Spark, Scala, and graphs but new to GraphX can skip chapter 3 and probably also chapter 5.

About the code

The source code for this book is available for download from manning.com at https://www.manning.com/books/spark-graphx-in-action.

For the most part, the code presented in this book and available for download is intended to be used with the interactive Spark shell. Thus, the .scala extension is technically a misnomer, as these files can’t be compiled with the scalac compiler.

Some examples are meant to be conventionally compiled and executed, and these are always accompanied by a pom.xml for Maven or by a .sbt for sbt (Simple Build Tool).

This book contains many examples of source code, both in numbered listings and inline with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text.

In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page space in the book. In rare cases, even this was not enough, and listings may include line-continuation markers (). Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts.

The code for the examples in this book can be downloaded from the publisher’s website at www.manning.com/books/spark-graphx-in-action.

Author Online

Purchase of Spark GraphX in Action includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum and subscribe to it, point your web browser to www.manning.com/books/spark-graphx-in-action. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum.

Manning’s commitment to our readers is to provide a venue where a meaningful dialog between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the AO remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions lest their interest stray!

The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

About the authors

MICHAEL MALAK has been writing software since before computers could be purchased in stores preassembled. He has been developing in Spark for two Fortune 200 companies since early 2013 and often gives presentations, especially in the Denver/Boulder region of Colorado where he lives. You can find his personal technical blog at http://technicaltidbit.com.

ROBIN EAST has worked as a consultant to large organizations for more than 15 years, delivering Big Data and content intelligence solutions in the fields of finance, government, healthcare, and utilities. He is a data scientist at Worldpay, helping them deliver their vision of putting data at the heart of everything they do. You can find his other writings on Spark, GraphX, and machine learning at https://mlspeed.wordpress.com.

Get Spark GraphX in Action
add to cart
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage