After reading lesson 29, you’ll be able to
- Take as input two files and determine their similarity
- Write organized code by using functions
- Understand how to work with dictionaries and lists in a real-life setting
How similar are two sentences? Paragraphs? Essays? You can write a program incorporating dictionaries and lists to calculate the similarity of two pieces of work. If you’re a teacher, you could use this to check for similarity between essay submissions. If you’re making changes to your own documents, you can use this program as a sort of version control, comparing versions of your documents to see where major changes were made.
The problem
You’re given two files containing text. Using the names of the files, write a program that reads the documents and uses a metric to determine how similar they are. Documents that are exactly the same should get a score of 1, and documents that don’t have any words in common should get a score of 0.
Given this problem description, you need to decide a few things:
- Do you count punctuation from the files or only words?
- Do you care about the ordering of the words in files? If two files have the same words but in different order, are they still the same?
- What metric do you use to assign a numerical value to the similarity?