chapter three

3 Ranking and content-based relevance

This chapter covers

Executing queries and returning matching search results
Ranking search results based upon how relevant they are to an incoming query
Keyword match and filtering vs. vector-based ranking
Controlling and specifying your own ranking functions with function queries
Catering ranking functions to a specific domain

Search engines fundamentally do three things: ingest content (indexing), return content matching incoming queries (matching), and sort the returned content based upon some measure of how well it matches the query (ranking). Additional layers can be added on top to allow users to provide better queries (autosuggest, chatbot dialogs, etc.) and to extract better answers from the results or summarize the results leveraging Large Language Models (see chapters 14-15), but the core functions of the search engine typically revolve around matching and ranking on indexed data.

Relevance is the term used to describe this notion of "how well the content matches the query". Most of the time the matched content is documents, and the returned and ranked content is those matched documents along with some corresponding metadata describing the documents.

3.1 Scoring query and document vectors with cosine similarity

3.1.1 Mapping text to vectors

3.1.2 Calculating similarity between dense vector representations

3.1.3 Calculating similarity between sparse vector representations

3.1.4 Term Frequency (TF): measuring how well documents match a term

3.1.5 Inverse Document Frequency (IDF): measuring the importance of a term in the query

3.1.6 TF-IDF: a balanced weighting metric for text-based relevance

3.2 Controlling the relevance calculation

3.2.1 BM25: Lucene’s default text-similarity algorithm

3.2.2 Functions, functions, everywhere!

3.2.3 Choosing multiplicative vs. additive boosting for relevance functions

3.2.4 Differentiating matching (filtering) vs. ranking (scoring) of documents

3.2.5 Logical matching: weighting the relationships between terms in a query

3.2.6 Separating concerns: filtering vs. scoring