part three

Part 3. Over the arc

Part 3 covers the missing pieces and documentation. In chapter 8, you’ll see algorithms you might expect to be part of the GraphX API but that aren’t as of Spark 1.6. From reading standard RDF format graph data to merging graphs, the algorithms in chapter 8 plug some of those holes.

Chapter 8 also covers how to use IndexedRDD, which is like the HashMap of RDDs. We go through an example showing how it can speed up performance.

Finally, you’ll see an example of identifying likely missing data from Wikipedia using ideas from graph isomorphisms—finding pieces of graphs that are similar to each other.

Chapter 9 is all about putting GraphX into production and doing debugging and performance tuning. It steps you through tools like DAG Visualization and the History Server, and provides a concrete set of tools like caching, checkpointing, and serializer tuning to improve the performance of your Spark GraphX application.