chapter three

3 Ranking and content-based relevance

This chapter covers

Executing queries and returning search results
Ranking search results based on how relevant they are to an incoming query
Keyword match and filtering versus vector-based ranking
Controlling and specifying custom ranking functions with function queries
Catering ranking functions to a specific domain

Search engines fundamentally do three things: ingest content (indexing), return content matching incoming queries (matching), and sort the returned content based on some measure of how well it matches the query (ranking). Additional layers can be added, allowing users to provide better queries (autosuggest, chatbot dialogs, etc.) and to extract better answers from the results or summarize the results by using large language models (see chapters 14–15), but the core functions of the search engine are matching and ranking on indexed data.

3.1 Scoring query and document vectors with cosine similarity

3.1.1 Mapping text to vectors

3.1.2 Calculating similarity between dense vector representations

3.1.3 Calculating similarity between sparse vector representations

3.1.4 Term frequency: Measuring how well documents match a term

3.1.5 Inverse document frequency: Measuring the importance of a term in the query

3.1.6 TF-IDF: A balanced weighting metric for text-based relevance

3.2 Controlling the relevance calculation

3.2.1 BM25: The industry standard default text-similarity algorithm

3.2.2 Functions, functions, everywhere!

3.2.3 Choosing multiplicative vs. additive boosting for relevance functions

3.2.4 Differentiating matching (filtering) vs. ranking (scoring) of documents

3.2.5 Logical matching: Weighting the relationships between terms in a query

3.2.6 Separating concerns: Filtering vs. scoring

3.3 Implementing user and domain-specific relevance ranking

Summary