8 Leveraging data locality and memory of your machines

 

This chapter covers

  • Data locality in big data processing
  • Optimizing join strategies with Apache Spark
  • How to reduce shuffling
  • Memory vs. disk usage in big data processing

With both streaming and batch processing in big data applications, we often need to use data from multiple sources to get insights and business value. The data locality pattern allows us to move computation to data. Our data can live in the database or the filesystem, and this situation is simple as long as our data fits into the disk or memory of our machines. Processing can be local and fast. But in big data applications, it is not feasible to store large amounts of data on one machine. We need to employ techniques such as partitioning to split the data into multiple machines. Once the data is on multiple physical hosts, it gets harder to gain insights from data that is distributed in locations that are reachable via the network. Joining data in such a scenario is not a trivial task and needs careful planning.

We will follow the process of joining data in such a big data scenario. Before we delve into that though, let's start our chapter by understanding the main concepts related to big data: data locality.

8.1 8.1 What is data locality?

 
 
 

8.1.1 Moving computations to data

 
 
 

8.1.2 Scaling processing using data locality

 
 
 
 

8.2 Data partitioning and splitting data

 
 
 

8.2.1 Offline big data partitioning

 
 
 

8.2.2 Partitioning vs. sharding

 
 
 

8.2.3 Partitioning algorithms

 
 

8.3 Join big data sets from multiple partitions

 
 

8.3.1 Joining data within the same physical machine

 
 
 

8.3.2 Joining that requires data movement

 
 
 

8.3.3 Optimizing join leveraging broadcasting

 
 
 

8.4 Data processing: Memory vs. disk

 
 
 

8.4.1 Using disk-based processing

 
 
 
 

8.4.2 Why do we need MapReduce?

 
 
 
 

8.4.3 Calculating access times

 
 
 
 

8.4.4 RAM-based processing

 

8.5 Implement joins using Apache Spark

 
 
 
 

8.5.1 Implementing a join without broadcast

 
 
 

8.5.2 Implementing a join with broadcast

 

8.6 Summary

 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest