Chapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word, and cleaned our records to only keep lowercase words. If we bring out our outline, we only have steps 4 and 5 to complete:
- [DONE]Read: Read the input data (we’re assuming a plain text file).
- [DONE]Token: Tokenize each word.
- [DONE]Clean: Remove any punctuation and/or tokens that aren’t words. Lowercase each word.
- Count: Count the frequency of each word present in the text.
- Answer: Return the top 10 (or 20, 50, 100).
After tackling those two last steps, we look at packaging our code in a single file to be able to submit it to Spark without having to launch a REPL. We also take a look at our completed program and at simplifying it by removing intermediate variables. We finish with scaling our program to accommodate more data sources.