chapter three

Chapter 3. Submitting and scaling your first PySpark program

 

This chapter covers:

  • Summarizing data using groupby and a simple aggregate function
  • Ordering results for display
  • Writing data from a data frame
  • Using spark-submit to launch your program in batch mode
  • Simplify the writing of your PySpark using method chaining
  • Scaling your program almost for free!

Chapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word and cleaned our records to only keep lower-case words. If we bring out our outline, we only have Step 4 and 5 to complete.

  1. [DONE] Read: Read the input data (we’re assuming a plain text file)
  2. [DONE] Token: Tokenize each word
  3. [DONE] Clean: Remove any punctuation and/or tokens that aren’t words.
  4. Count: Count the frequency of each word present in the text
  5. Answer: Return the top 10 (or 20, 50, 100)

After tackling those two last steps, we’ll look at packaging our code in a single file to be able to submit it to Spark without having to launch a shell. We’ll take a look at our completed program and look at simplifying our program by removing intermediate variables. We’ll finish with scaling our program to accommodate more data sources.

3.1  Grouping records: Counting word frequencies

3.2  Ordering the results on the screen using orderBy

3.3  Writing data from a data frame

3.4  Putting it all together: counting

3.4.1  Simplifying your dependencies with PySpark’s import conventions

3.4.2  Simplifying our program via method chaining

3.5  Your first non-interactive program: using spark-submit

3.5.1  Creating your own SparkSession

3.6  Using spark-submit to launch your program in batch mode

3.7  What didn’t happen in this Chapter

3.8  Scaling up our word frequency program

3.9  Summary

3.10  Exercises