chapter three

3 Submitting and scaling your first PySpark program

This chapter covers

Summarizing data using groupby and a simple aggregate function
Ordering results for display
Writing data from a data frame
Using spark-submit to launch your program in batch mode
Simplify the writing of your PySpark using method chaining
Scaling your program to multiple files at once by changing a single line of code

Chapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word, and cleaned our records to only keep lower-case words. If we bring out our outline, we only have Steps 4 and 5 to complete.

[DONE] Read: Read the input data (we’re assuming a plain text file)
[DONE] Token: Tokenize each word
[DONE] Clean: Remove any punctuation and/or tokens that aren’t words. Lower-case each word.
Count: Count the frequency of each word present in the text
Answer: Return the top 10 (or 20, 50, 100)

After tackling those two last steps, we’ll look at packaging our code in a single file to be able to submit it to Spark without having to launch a REPL. We’ll take a look at our completed program and look at simplifying our program by removing intermediate variables. We’ll finish with scaling our program to accommodate more data sources.

3.1 Grouping records: Counting word frequencies

3.2 Ordering the results on the screen using `orderBy`

3.3 Writing data from a data frame

3.4 Putting it all together: counting

3 Submitting and scaling your first PySpark program

This chapter covers

3.1 Grouping records: Counting word frequencies

3.2 Ordering the results on the screen using `orderBy`

3.3 Writing data from a data frame

3.4 Putting it all together: counting

3.4.1 Simplifying your dependencies with PySpark’s import conventions

3.4.2 Simplifying our program via method chaining

3.5 Using `spark-submit` to launch your program in batch mode

3.6 What didn’t happen in this chapter

3.7 Scaling up our word frequency program

3.8 Summary

3.9 Additional Exercises

3.9.1 Exercise 3.3

3.9.2 Exercise 3.4

3.9.3 Exercise 3.5

3.9.4 Exercise 3.6

3 Submitting and scaling your first PySpark program

This chapter covers

3.1 Grouping records: Counting word frequencies

3.2 Ordering the results on the screen using orderBy

3.3 Writing data from a data frame

3.4 Putting it all together: counting

3.4.1 Simplifying your dependencies with PySpark’s import conventions

3.4.2 Simplifying our program via method chaining

3.5 Using spark-submit to launch your program in batch mode

3.6 What didn’t happen in this chapter

3.7 Scaling up our word frequency program

3.8 Summary

3.9 Additional Exercises

3.9.1 Exercise 3.3

3.9.2 Exercise 3.4

3.9.3 Exercise 3.5

3.9.4 Exercise 3.6

3.2 Ordering the results on the screen using `orderBy`

3.5 Using `spark-submit` to launch your program in batch mode