3 Ranking and content-based relevance

This chapter covers:

Executing queries and returning matching search results
Ranking search results based upon how relevant they are to an incoming query
Controlling and specifying your own ranking functions with function queries
Catering ranking functions to a specific domain

Search engines fundamentally do three things: ingest content, return content matching incoming queries, and sort the returned content based upon some measure of how well it matches the query. Relevance is the term used to describe this notion of "how well the content matches the query". Most of the time the matched content is documents, and the returned and ranked content is those matched documents along with some corresponding metadata describing the documents.

In most search engines, the default relevance sorting is based upon a score indicating how well each keyword in a query matches the same keyword in each document, with the best matches yielding the highest relevance score and returned at the top of the search results. The relevance calculation is highly configurable, however, and can be easily adjusted on a per-query-basis in order to enable very sophisticated ranking behavior.

3.1 Scoring query and document vectors with cosine similarity

3.1.1 Mapping text to vectors

3.1.2 Calculating similarity between dense vector representations

3.1.3 Calculating similarity between sparse vector representations

3.1.4 Term Frequency (TF): measuring how well documents match a term

3.1.5 Inverse Document Frequency (IDF): measuring the importance of a term in the query

3.1.6 TF-IDF: a balanced weighting metric for text-based relevance

3.2 Controlling the relevance calculation

3.2.1 BM25: Lucene’s default text-similarity algorithm

3.2.2 Functions, functions, everywhere!

3.2.3 Choosing multiplicative vs. additive boosting for relevance functions

3.2.4 Differentiating matching (filtering) vs. ranking (scoring) of documents

3.2.5 Logical matching: weighting the relationships between terms in a query

3.2.6 Separating concerns: filtering vs. scoring