part four

Part 4. Going further

You’re hitting the last part of this book. However, your journey is only about to start, or, if you have started it, it will become even more exciting. That’s why the next three chapters will bring knowledge and answers to a lot of your questions but will also trigger more questions and guide you to more sources of knowledge. This is also the time that you can start to put everything together, in an integrated way, like building complete pipelines.

There is no doubt: Apache Spark is fast. However, performance is driven not only by the engine, but also by how you use the engine. Chapter 16 focuses primarily on two optimization techniques called caching and checkpointing . After seeing an example using theoretical data to explain caching, you will take a deep dive with real-life data and analytics. I will conclude by giving more hints and resources on further optimizing Spark.

Up to now, with the exception of chapter 2, you have been ingesting, processing, and simply showing onscreen the result of transformations and actions. Isn’t it about time we do something with this data, like exporting it to files? Chapter 17 focuses on those operations and explains the impact of partitions on this project. Be careful, as chapter 17 may also include a not-so-subtle reference to the Hitchhiker’s Guide to the Galaxy , a must for any computer book, no? Chapter 17 will also help you explore using cloud services with Spark.