chapter eight

Chapter 8. The missing algorithms

This chapter covers

Reading RDF files
Merging graphs
Filtering out isolated vertices
Using IndexedRDD for performance gains
Taking a simplistic approach to finding graph isomorphisms
Computing the global clustering coefficient

You’ve seen examples of reading graph data from edge list files in earlier chapters. RDF is another important file format used for many existing file formats. This chapter shows you how to read in this file format and use this knowledge to make use of the YAGO3 dataset.

Aside from the classic graph algorithms from chapter 6, there are other slightly more modern algorithms that one comes to expect in a graph database or graph processing system. Some of these are missing—not implemented yet (or at least not commonly available in either the official Apache Spark distribution as of Spark 1.6 or even from spark-packages.org).

In this chapter, you’ll see how to implement some of these algorithms. You’ll also see how to use IndexedRDD for performance gains. IndexedRDD was originally written by one of the main GraphX code contributors but never merged into the Apache Spark distribution.

Chapter 8. The missing algorithms

This chapter covers

8.1. Missing basic graph operations

8.2. Reading RDF graph files

8.3. Poor man’s graph isomorphism: finding missing Wikipedia infobox items

8.4. Global clustering coefficient: compare connectedness

8.5. Summary