chapter three
This chapter covers
- Summarizing data using
groupbyand a simple aggregate function
- Ordering results for display
- Writing data from a data frame
- Using
spark-submitto launch your program in batch mode
- Simplify the writing of your PySpark using method chaining
- Scaling your program to multiple files at once by changing a single line of code
Chapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word, and cleaned our records to only keep lower-case words. If we bring out our outline, we only have Steps 4 and 5 to complete.
- [DONE] Read: Read the input data (we’re assuming a plain text file)
- [DONE] Token: Tokenize each word
- [DONE] Clean: Remove any punctuation and/or tokens that aren’t words. Lower-case each word.
- Count: Count the frequency of each word present in the text
- Answer: Return the top 10 (or 20, 50, 100)
After tackling those two last steps, we’ll look at packaging our code in a single file to be able to submit it to Spark without having to launch a REPL. We’ll take a look at our completed program and look at simplifying our program by removing intermediate variables. We’ll finish with scaling our program to accommodate more data sources.