17 Case study 4 solution

 

This section covers

  • Parsing text from HTML
  • Computing text similarities
  • Clustering and exploring large text datasets

We have downloaded thousands of job postings by searching on this book’s table of contents for case studies 1 through 4 (see the problem statement for details). Besides the downloaded postings, we also have at our disposal two text files: resume.txt and table_of_contents.txt. The first file contains a resume draft, and the second contains the truncated table of contents used to query for job listing results. Our goal is to extract common data science skills from the downloaded job postings. Then we’ll compare these skills to our resume to determine which skills are missing. We will do so as follows:

  1. Parse all text from the downloaded HTML files.
  2. Explore the parsed output to learn how job skills are described in online postings. We’ll pay particular attention to whether certain HTML tags are more associated with skill descriptions.
  3. Attempt to filter any irrelevant job postings from our dataset.
  4. Cluster job skills based on text similarity.
  5. Visualize the clusters using word clouds.
  6. Adjust clustering parameters, if necessary, to improve the visualized output.
  7. Compare the clustered skills to our resume to uncover missing skills.

17.1 Extracting skill requirements from job posting data

17.1.1 Exploring the HTML for skill descriptions

17.2 Filtering jobs by relevance

17.3 Clustering skills in relevant job postings

17.3.1 Grouping the job skills into 15 clusters

17.3.2 Investigating the technical skill clusters

17.3.3 Investigating the soft-skill clusters

17.3.4 Exploring clusters at alternative values of K

17.3.5 Analyzing the 700 most relevant postings

17.4 Conclusion

Summary