Last update: 2019-09-25
Python 2 is on the downfall, with a scheduled end-of-life on January 1st, 2020. The transition will happen while this book is in writing. At the moment, most OSes are in the transitional period between Python 2 and 3, which is why I spend a little time discussing how to install Python 3. I expect this guide to become simpler as time goes.
I am currently targeting the most recent version of each OS
- Windows 10
- OS.X Mojave
- Ubuntu 18.04 LTS
As OS version will get updated, I’ll continue updating this Appendix.
This appendix covers the installation of standalone Spark and PySpark on your own computer, whether it’s running Windows, Os.X or Linux.
Having a local PySpark cluster means that you’ll be able to experiment with the syntax, using smaller data sets. You don’t have to acquire multiple computers or spend any money on managed PySpark on the cloud until you’re ready to scale your programs.
Spark is a complex piece of software and most guides out there are over-complicating the installation proces. We’ll take a much simpler approach by installing the bare minimum to start, and building from there. Our goals are as follow:
- Install Java (Spark is written in Scala, which runs on the Java Virtual Machine, or JVM).
- Install Spark
- Install Python 3 and IPython
- Launch a PySpark shell using IPython
We will use the command line as much as possible, to make the installation steps easily reproducible. We are covering the following options:
- Windows 7 and 10 (plain installation)
- Windows 10 with Windows Subsystem for Linux (WSL)
- Mac Os.X Mojave, using Homebrew
- Linux (Ubuntu), using apt
Note
You’ll need admin rights on the machine you’re trying to install Spark on. I find that fiddling to make it work on a locked-down work computer is often not worth it. If you can’t install it using the instructions in this Appendix, have a look at low-cost/no-cost cloud options in Appendix B.
highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

Depending on your OS and installation strategy, there are some OS specific steps we need to follow. Table A.1 lists them: please have a look to see if your OS is listed.
Table A.1. Preliminary steps to accomplish before installing PySpark on your personal computer
OS | Preliminary Steps | Section |
Windows (plain) |
Install 7-zip |
|
Windows (WSL) |
Install WSL |
|
Os.X |
Install HomeBrew |
Spark is available as a GZIP archive (.tgz
) file on their website. By default, windows doesn’t provide a native way to extract those files. The most popular option is 7-zip[5]. Simply go on the website, download the program and follow the installation instructions.
Windows Subsystem for Linux is similar to a Linux virtual machine running on your windows OS. It’s easy to install, configure, and integrates seamlessly with your Windows. When using Windows, I use Spark via the WSL because I find it simpler to install.
The first step is to make sure the WSL flag is enabled in your Windows installation. Go to aka.ms/wslinstall and follow the instructions on the website. You’ll be asked to reboot after.
Once your computer is rebooted, search for "Ubuntu" in the Windows Store. This will install Ubuntu as a WSL distribution. You can now follow the Linux/Ubuntu instructions to install PySpark!
HomeBrew is a package manager for OS.X. It provides a simple command line interface to install many popular software packages and keep them up to date. While you can follow the manual "download and install" steps you’ll find on the Windows OS with little change, Homebrew will simplify our installation process to a few commands.
To install Homebrew, go to brew.sh and follow the installation instructions. You’ll be able to interact with Homebrew through the brew
command.
discuss

Spark works best using Java 8. You might already have Java on your computer. On Windows, look for the "Java" and "JRE" keywords in the list of installed programs. You can also (all OS) open a terminal and type the following.
java -version
Look for something looking like "version 1.8.0_XYZ". If you have 1.11 or 1.12, it should work as well. The Spark website provides a compatibility matrix with Scala/JVM versions on the official documentation which evolves rather quickly.
The easiest way to install Java on Windows is to go on www.java.com and follow the download and installation instructions. Make sure to read the installer steps to avoid installing non-useful software!
With Homebrew installed, open a terminal and type the following
$ brew cask install homebrew/cask-versions/adoptopenjdk8
The installer will prompt for your password during the installation process.
Most GNU/Linux distributions provide a package manager. OpenJDK version 8 is available through the software repository.
`sudo apt-get install openjdk-8-jre`
Now that Java is installed, we can now look at installing Spark.
settings

Spark is available through the Apache Project website (spark.apache.org). The instruction are pretty much identical for every OS beside Os.X, since Homebrew provides Spark as a package.
Go on the Apache website and download the latest Spark release. You shouldn’t have to change the default options, but A.1 displays the ones I see when I navigate to the download page. Make sure to download the signatures and checksums if you want to validate the download (step 4 on the page).
Figure A.1. The options to download Spark

Tip
On WSL (and sometimes Linux), you don’t have a graphical user interface really available. The easiest way to download Spark is to go on the website, follow the line, copy the link of the nearest mirror and past it along with wget
command.
wget [YOUR_PASTED_DOWNLOAD_URL]
If you want to know more about using the command line on Linux (and Os.X) proficiently, a good free reference is The Linux Command Line by William Shotts[6]. It is also available as a paper or e-book (No Starch Press, 2019).
Once you have downloaded the file, unzip the file (using 7-zip on Windows). If you are using the command line, the following command will do the trick. Make sure you’re replacing the spark-[…].gz
by the name of the file you just downloaded.
tar -xvzf spark-[...].gz
This will unzip the content of the archive into a directory. You can now rename or move the directory to your liking. I’ll keep it where it is for the remainder of the Appendix.
If you are on GNU Linux or WSL, you can skip to the next section. For plain Windows installation, you’ll also need to download a file called winutils.exe
and set a few environment variables to prevent some cryptic Hadoop errors. Go on the github.com/cdarlint/winutils repository and download the winutils.exe
file in the hadoop-2.7.X/bin
directory where X is the highest number. As of the time of writing, it is 2.7.7. Keep the README.md
of the repository handy.
Place the winutils.exe
in the bin
directory of your Spark. Then, set the environment variables as listed on the winutils repository’s README.md
To do so, open the start menu and search for "Edit the system environment variables". Click on the "Environment variables button" (see figure A.2) and then add them there.
Note
For the PATH
variable, you might already have some in there. If this is the case, double click on the variable and append %HADOOP_HOME%\bin
to the list.
Figure A.2. Setting environment variables for Hadoop on Windows

Homebrew strikes again! Input the following command in a terminal.
$ brew install apache-spark
highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

On Windows, Python doesn’t come installed. OS.X provides Python 2. Ubuntu provides no python
command, preferring the python2
and python3
commands. We’ll provide a surefire way to get Python 3 for Windows and OS.X, and provide a guide to setup the default Python 3 on Ubuntu with IPython.
The easiest way to get Python 3 is to use the Anaconda Distribution. Go on www.anaconda.com/distribution and follow the installation instructions, making sure you’re getting the 64-bits Graphical installer for Python 3.X for your OS.
Once Anaconda is installed, we can activate the Python 3 environment by inputing in a command line.
$ conda activate base
This will prepend a little (base)
to your shell prompt, meaning that you’re now working within the base anaconda python environment.
Tip
On Windows, you’ll need to use the "Anaconda Powershell Prompt" that you can find in your Start Menu. By default, (base)
will be selected for you.
Python 3 is already provided, you just have to install IPython. Input the following command in a terminal.
sudo apt-get install ipython3
discuss

Launch the Anaconda Powershell Prompt using the start menu and navigate to the bin
directory of your Spark installation. (You can use the cd
command, which stands for "Change Directory" to move around.
Tip
If you aren’t comfortable with the Command Line and Powershell, I’ve personally learned to use it using Learn Powershell in a Month of Lunches by Don Jones and Jeffery D. Hicks (Manning, 2016).
In order to use IPython as a front-end for PySpark, you have to set the proper environment variable. Use the following code block in o
Set-Item Env:PYSPARK_DRIVER_PYTHON ipython pyspark.cmd
Assuming that the terminal points to the bin/
folder of where Spark was unzipped, you just have to use the following command.
`PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_PYTHON=python3 pyspark`
PYSPARK_DRIVER_PYTHON
will give you the shell flavor to use on the driver we interact with.
Note
If you launch PySpark and see Python version 2.7 instead of 3.X, it means you haven’t conda activate base
before launching PySpark. exit
the PySpark shell and have a look at A.4.1 “Windows, Os.X”.
Since Ubuntu doesn’t provide a plain python
command anymore, we have to specify two environment variables to use PySpark with IPython and Python 3. Assuming that the terminal points to the bin/
folder of where Spark was unzipped, you just have to use the following command.
PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_PYTHON=python3 pyspark
PYSPARK_PYTHON
gives Spark the indication to use a specific version of Python on the cluster (even if we’re on a local machien), while PYSPARK_DRIVER_PYTHON
will give you the shell flavor to use on the driver we interact with.