chapter three
This chapter covers:
- Summarizing data using
groupbyand a simple aggregate function
- Ordering results for display
- Writing data from a data frame
- Using
spark-submitto launch your program in batch mode
- Simplify the writing of your PySpark using method chaining
- Scaling your program almost for free!
Chapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word and cleaned our records to only keep lower-case words. If we bring out our outline, we only have Step 4 and 5 to complete.
- [DONE] Read: Read the input data (we’re assuming a plain text file)
- [DONE] Token: Tokenize each word
- [DONE] Clean: Remove any punctuation and/or tokens that aren’t words.
- Count: Count the frequency of each word present in the text
- Answer: Return the top 10 (or 20, 50, 100)
After tackling those two last steps, we’ll look at packaging our code in a single file to be able to submit it to Spark without having to launch a shell. We’ll take a look at our completed program and look at simplifying our program by removing intermediate variables. We’ll finish with scaling our program to accommodate more data sources.