chapter seven

Chapter 7. Utilizing data structures and algorithms at scale

This chapter covers

Representing and using data structures such as graphs, HyperLogLog, and Bloom filters in MapReduce
Applying algorithms such as PageRank and semi-joins to large amounts of data
Learning how social network companies recommend making connections with people outside your network

In this chapter we’ll look at how you can implement algorithms in MapReduce to work with internet-scale data. We’ll focus on nontrivial data, which is commonly represented using graphs.

We’ll also look at how you can use graphs to model connections between entities, such as relationships in a social network. We’ll run through a number of useful algorithms that can be performed over graphs, such as shortest path and friends-of-friends (FoF), to help expand the interconnectedness of a network, and PageRank, which looks at how to determine the popularity of web pages.

You’ll learn how to use Bloom filters, whose unique space-saving properties make them handy for solving distributed system problems in P2P (peer-to-peer) and distributed databases. We’ll also create Bloom filters in MapReduce and then look at their usefulness for filtering.

You’ll also learn about another approximate data structure called HyperLogLog, that provides approximate unique counts, which are invaluable in aggregation pipelines.

Chapter 7. Utilizing data structures and algorithms at scale

This chapter covers

7.1. Modeling data and solving problems with graphs

7.2. Bloom filters

7.3. HyperLogLog

7.4. Chapter summary