We have downloaded thousands of job postings by searching on this book’s table of contents for case studies 1 through 4 (see the problem statement for details). Besides the downloaded postings, we also have at our disposal two text files: resume.txt and table_of_contents.txt. The first file contains a resume draft, and the second contains the truncated table of contents used to query for job listing results. Our goal is to extract common data science skills from the downloaded job postings. Then we’ll compare these skills to our resume to determine which skills are missing. We will do so as follows:
- Parse all text from the downloaded HTML files.
- Explore the parsed output to learn how job skills are described in online postings. We’ll pay particular attention to whether certain HTML tags are more associated with skill descriptions.
- Attempt to filter any irrelevant job postings from our dataset.
- Cluster job skills based on text similarity.
- Visualize the clusters using word clouds.
- Adjust clustering parameters, if necessary, to improve the visualized output.
- Compare the clustered skills to our resume to uncover missing skills.