Lesson 29. Capstone project: document similarity

 

After reading lesson 29, you’ll be able to

  • Take as input two files and determine their similarity
  • Write organized code by using functions
  • Understand how to work with dictionaries and lists in a real-life setting

How similar are two sentences? Paragraphs? Essays? You can write a program incorporating dictionaries and lists to calculate the similarity of two pieces of work. If you’re a teacher, you could use this to check for similarity between essay submissions. If you’re making changes to your own documents, you can use this program as a sort of version control, comparing versions of your documents to see where major changes were made.

The problem

You’re given two files containing text. Using the names of the files, write a program that reads the documents and uses a metric to determine how similar they are. Documents that are exactly the same should get a score of 1, and documents that don’t have any words in common should get a score of 0.

Given this problem description, you need to decide a few things:

  • Do you count punctuation from the files or only words?
  • Do you care about the ordering of the words in files? If two files have the same words but in different order, are they still the same?
  • What metric do you use to assign a numerical value to the similarity?

29.1. Breaking the problem into tasks

 

29.2. Reading file information

 
 
 

29.3. Saving all words from the file

 
 
 

29.4. Mapping words to their frequency

 
 

29.5. Comparing two documents by using a similarity score

 
 

29.6. Putting it all together

 
 
 
 

29.7. One possible extension

 
 

Summary

 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage