8 Leveraging data locality and memory of your machines
This chapter covers
- Leveraging data locality in your Big Data processing
- Optimizing joining strategies with Apache Spark and reducing shuffling
- Memory vs disk usage and how to use both resources in the big data processing
In big data applications, both streaming and batch processing, we often need to use data from multiple sources to get insights and business value. The data locality pattern allows us to move computation to data. The data can live in the database or the file system. The situation is simple as long as our data fits into the disk or memory of our machines. Processing can be local and fast. In big data applications, it is not feasible to store big amounts of data on one machine in the real world. We need to employ techniques such as partitioning to split the data into multiple machines. Once the data is on multiple physical hosts, it gets harder to get insights from data that is distributed in locations that are reachable via the network. Joining data in such a scenario is not a trivial task and needs careful planning. In this section, we will follow the process of joining data in such a big data scenario. Before we delve into it, let's start our chapter by understanding the main concepts related to big data: data locality.