Appendix A. Installing Spark


This appendix covers

  • The quickest ways to get started in Spark
  • Using virtual machines (VMs) to run Spark
  • Using Amazon Web Services / Elastic Map/Reduce to run Spark

Using Spark typically means first having 1) Hadoop installed and 2) a cluster of machines to run them on. The simplest scenario is if you’re doing GraphX work for your job and your job already has a Hadoop/Spark cluster set up that you can use. If that’s not the case, this appendix is for you. It describes various options where you don’t necessarily need either Hadoop or a cluster of machines.

The three options described in this appendix are as follows:

1.  On a local virtual machine—Cloudera QuickStart VM (with Hadoop and Spark preinstalled and ready to use).

2.  On your Linux or OS X laptop, desktop, or VM—Hadoop is not necessary.

3.  In the cloud—Amazon Web Services.

A few developers prefer to do all development on virtual machines, and this appendix reflects that not-too-common bias. (In this context, we mean VMs hosted on one’s laptop using VMWare Player or VirtualBox, not VMs in the cloud.) Multiple VMs allow one to easily work on multiple projects, each with their own environments, versions of Java, versions of Scala, OS versions, and so on. And VMs are easy to hand over to colleagues and team members. As a final benefit, VMs allow one to copy and paste to/from the host OS where email client, familiar tools, and data files reside.

A.1. On a local virtual machine: CDH QuickStart VM

A.2. Onto your laptop and Hadoopless: Linux or OS X

A.3. In the cloud: Amazon Web Services