8 Leveraging data locality and memory of your machines

 

This chapter covers

  • Leveraging data locality in your Big Data processing
  • Optimizing joining strategies with Apache Spark and reducing shuffling
  • Memory vs disk usage and how to use both resources in the big data processing

In big data applications, both streaming and batch processing, we often need to use data from multiple sources to get insights and business value. The data locality pattern allows us to move computation to data. The data can live in the database or the file system. The situation is simple as long as our data fits into the disk or memory of our machines. Processing can be local and fast. In big data applications, it is not feasible to store big amounts of data on one machine in the real world. We need to employ techniques such as partitioning to split the data into multiple machines. Once the data is on multiple physical hosts, it gets harder to get insights from data that is distributed in locations that are reachable via the network. Joining data in such a scenario is not a trivial task and needs careful planning. In this section, we will follow the process of joining data in such a big data scenario. Before we delve into it, let's start our chapter by understanding the main concepts related to big data: data locality.

8.1       What is data locality?

8.1.1   Moving computations to data

8.1.2   Scaling processing using data locality

8.2       Splitting data that does not fit on one machine using data partitioning

8.2.1   Offline big data partitioning

8.2.2   Partitioning vs sharding

8.2.3   Partitioning algorithms

8.3       Join big data sets from multiple partitions

8.3.1   Joining data within the same physical machine

8.3.2   Joining that requires data movement

8.3.3   Joining with leveraging broadcast

8.4       Data processing leveraging memory vs disk

8.4.1   Disk-based processing

8.4.2   Why do we need Map-Reduce?

8.4.3   Calculating access times

8.4.4   RAM-based processing

8.5       Implement joins using Apache Spark

8.5.1   Implement join without broadcast

8.5.2   Implement join with broadcast

8.6       Summary