Chapter 1. Introduction to Apache Spark

 

This chapter covers

  • What Spark brings to the table
  • Spark components
  • Spark program flow
  • Spark ecosystem
  • Downloading and starting the spark-in-action virtual machine

Apache Spark is usually defined as a fast, general-purpose, distributed computing platform. Yes, it sounds a bit like marketing speak at first glance, but we could hardly come up with a more appropriate label to put on the Spark box.

Apache Spark really did bring a revolution to the big data space. Spark makes efficient use of memory and can execute equivalent jobs 10 to 100 times faster than Hadoop’s MapReduce. On top of that, Spark’s creators managed to abstract away the fact that you’re dealing with a cluster of machines, and instead present you with a set of collections-based APIs. Working with Spark’s collections feels like working with local Scala, Java, or Python collections, but Spark’s collections reference data distributed on many nodes. Operations on these collections get translated to complicated parallel programs without the user being necessarily aware of the fact, which is a truly powerful concept.

1.1. What is Spark?

1.2. Spark components

1.3. Spark program flow

1.4. Spark ecosystem

1.5. Setting up the spark-in-action VM

1.6. Summary

sitemap