Chapter 14. Simil: an algorithm to look for similar strings

 

Tom van Stiphout

Are you a perfect speller? Is everyone in your company? How about your business partners? Misspellings are a fact of life. There are also legitimate differences in spelling: what Americans call rumors, the British call rumours. Steven A. Ballmer and Steve Ballmer are two different but accurate forms of that man’s name. Your database may contain a lot of legacy values from the days before better validation at the point of data entry.

Overall, chances are your database already contains imperfect textual data, which makes it hard to search. Additionally, the user may not know exactly what to look for. When looking for a number or a date, we could search for a range, but text is more unstructured, so database engines such as SQL Server include a range of tools to find text, including the following:

  • EQUALS (=) and LIKE
  • SOUNDEX and DIFFERENCE
  • CONTAINS and FREETEXT
  • Simil

Equals and LIKE search for equality with or without wildcards. SOUNDEX uses a phonetic algorithm based on the sound of the consonants in a string. CONTAINS is optimized for finding inflectional forms and synonyms of strings.

Simil is an algorithm that compares two strings, and based on the longest common substrings, computes a similarity between 0 (completely different) and 1 (identical). This is sometimes called fuzzy string matching. Simil isn’t available by default. Later in this chapter we’ll discuss how to install it.

Equals (=) and LIKE

SOUNDEX and DIFFERENCE

CONTAINS and FREETEXT

Simil

Summary

About the author

sitemap