8 Leveraging data locality and memory of your machines
This chapter covers
- Data locality in big data processing
- Optimizing join strategies with Apache Spark
- How to reduce shuffling
- Memory vs. disk usage in big data processing
With both streaming and batch processing in big data applications, we often need to use data from multiple sources to get insights and business value. The data locality pattern allows us to move computation to data. Our data can live in the database or the filesystem, and this situation is simple as long as our data fits into the disk or memory of our machines. Processing can be local and fast. But in big data applications, it is not feasible to store large amounts of data on one machine. We need to employ techniques such as partitioning to split the data into multiple machines. Once the data is on multiple physical hosts, it gets harder to gain insights from data that is distributed in locations that are reachable via the network. Joining data in such a scenario is not a trivial task and needs careful planning.
We will follow the process of joining data in such a big data scenario. Before we delve into that though, let's start our chapter by understanding the main concepts related to big data: data locality.