List of Figures

 

Chapter 1. Two important technologies: Spark and graphs

Figure 1.1. Big Data is data that is too big to fit on a single machine. Hadoop and Spark are technologies that distribute Big Data across a cluster of nodes. Spark is faster than Hadoop alone because it distributes data across the RAM in the cluster instead of the disks.

Figure 1.2. Three data blocks distributed with replication factor 2 across a Hadoop Distributed File System (HDFS)

Figure 1.3. MapReduce is the processing paradigm used by both Hadoop and Spark. Shown is a MapReduce operation to count the number of times “error” appears in a server log. The Map is (normally) a one-to-one operation that produces one transformed data item for each source data item. The Reduce is a many-to-one operation that summarizes the Map outputs. Both Hadoop and Spark use the MapReduce paradigm.

Figure 1.4. Spark provides RDDs that can be viewed as distributed in-memory arrays.

Figure 1.5. If Charles shares his status with friends of friends, determining the list of who could see his status would be cumbersome to figure out if you only had tables or arrays to work with.

Figure 1.6. The links between web pages can be represented as a graph. The structure of the graph provides information about the relative authority, or ranking, of each page.

Figure 1.7. Different types of data that can be represented by graphs