Chapter 5. Built-in algorithms


This chapter covers

  • Algorithms that come with the GraphX API
  • Detecting clusters within graphs: PageRank, Shortest Paths, Connected Components, Label Propagation
  • Measuring connectedness of a graph or subgraph with Triangle Count
  • Measuring the connectedness of a subset of users in a social network graph and finding isolated populations

In chapter 4 you learned about the foundational GraphX APIs that enable you to write your own custom algorithms. But there’s no need for you to reinvent the wheel in cases where the GraphX API already provides an implemented standard algorithm. In this chapter, we describe some of those basic algorithms and discuss which situations they can be used in:

  • PageRank
  • Personalized PageRank
  • Triangle Count
  • Shortest Paths
  • Connected Components
  • Strongly Connected Components
  • Label Propagation

We wait until chapter 7 to cover SVDPlusPlus, one of the more useful and advanced built-in algorithms.

5.1. Seek out authoritative nodes: PageRank

Chapter 2 covered an example invoking the PageRank algorithm, which was originally invented to rank web pages for a search engine, but there we used it to find influential papers in a citation network. Generally speaking, PageRank can be used to find the “important” nodes in almost any graph. Here we go more in depth into PageRank: what it does under the covers and parameters and different ways of invoking it. Note that PageRank is patented by Stanford University and trademarked by Google.

5.2. Measuring connectedness: Triangle Count

5.3. Find the fewest hops: ShortestPaths

5.4. Finding isolated populations: Connected Components

5.5. Reciprocated love only, please: Strongly Connected Components

5.6. Community detection: LabelPropagation

5.7. Summary