concept similarity in category search

appears as: similarity, similarity, similarities, A similarity
AI-Powered Search MEAP V06

This is an excerpt from Manning's book AI-Powered Search MEAP V06.

With a new word embedding vector now available for each term sequence in the left-most column of Figure 2.11, we can now score the relationship between each pair of term sequences leveraging the similarity between their vectors. In Linear Algebra, we will use a cosine similarity function to score the relationship between two vectors, which is simply computed by performing a dot product between the two vectors and scaling it by the magnitudes (lengths) of each of the vectors. We’ll visit the math in more detail in future chapters, but for now, Figure 2.12 shows the results of scoring the similarity between several of these vectors.

Figure 2.12. Similarity between Word Embeddings. The dot product between vectors shows the items list sorted by similarity with "green tea", with "cheese pizza", and with "donut".
vector search score

As you can see in Figure 2.12, since each term sequence is now encoded into a vector that represents its meaning in terms of higher-level features, this vector (or word embedding) can now be used to score the similarity of that term sequence with any other similar vector. You’ll see three vector similarity lists at the bottom of the figure: one for "green tea", one for "cheese pizza", and one for "donut".

Figure 2.12. Similarity between Word Embeddings. The dot product between vectors shows the items list sorted by similarity with "green tea", with "cheese pizza", and with "donut".
vector search score

In Chapter 2, we demonstrated the idea of measuring the similarity of two vectors by calculating the cosine between them. We created vectors (lists of numbers, where each number represents the strength of some feature) representing different food items, and we then calculated the cosine (the size of the angle between the vectors) in order to determine their similarity. We’ll expand upon that technique in this section, discussing how text queries and documents can map into vectors for ranking purposes. We’ll further get into some popular text-based feature weighting techniques and how they can be integrated to create an improved relevance ranking formula.

Relevant Search: With applications for Solr and Elasticsearch

This is an excerpt from Manning's book Relevant Search: With applications for Solr and Elasticsearch.

You can infer something about the similarity of two pieces of fruit by computing the dot product of their two vectors. In the fruit example, this means (1) multiplying the juiciness of each fruit together, (2) multiplying the size, and (3) summing the results. It turns out that the more properties fruit share in common, the higher the dot product.

Will BM25 help your relevance? It’s not that simple. As we discussed in chapter 1, information retrieval focuses heavily on broad, incremental improvements to article-length pieces of text. BM25 may not matter for your specific definition of relevance. For this reason, we intentionally eschew the additional complexity of BM25 in this book. Lucene won’t be deprecating classic TF × IDF at all; instead, it will become known as the classic similarity. Don’t be shy about experimenting with both. As for this book’s examples, you can re-create the scoring in future Elasticsearch versions by changing the similarity back to the classic similarity. Finally, every lesson you learn from this book applies, regardless of whether you choose BM25 or classic TF × IDF.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest