8 Leveraging data locality and memory of your machines

This chapter covers

Data locality in big data processing
Optimizing join strategies with Apache Spark
How to reduce shuffling
Memory vs. disk usage in big data processing

With both streaming and batch processing in big data applications, we often need to use data from multiple sources to get insights and business value. The data locality pattern allows us to move computation to data. Our data can live in the database or the filesystem, and this situation is simple as long as our data fits into the disk or memory of our machines. Processing can be local and fast. But in big data applications, it is not feasible to store large amounts of data on one machine. We need to employ techniques such as partitioning to split the data into multiple machines. Once the data is on multiple physical hosts, it gets harder to gain insights from data that is distributed in locations that are reachable via the network. Joining data in such a scenario is not a trivial task and needs careful planning.

We will follow the process of joining data in such a big data scenario. Before we delve into that though, let's start our chapter by understanding the main concepts related to big data: data locality.

8.1 8.1 What is data locality?

8.1.1 Moving computations to data

8.1.2 Scaling processing using data locality

8.2 Data partitioning and splitting data

8.2.1 Offline big data partitioning

8.2.2 Partitioning vs. sharding

8.2.3 Partitioning algorithms

8.3 Join big data sets from multiple partitions

8.3.1 Joining data within the same physical machine

8.3.2 Joining that requires data movement

8.3.3 Optimizing join leveraging broadcasting

8.4 Data processing: Memory vs. disk

8.4.1 Using disk-based processing

8.4.2 Why do we need MapReduce?

8.4.3 Calculating access times

8.4.4 RAM-based processing

8.5 Implement joins using Apache Spark

8.5.1 Implementing a join without broadcast

8.5.2 Implementing a join with broadcast

8.6 Summary