Chapter 4. Taming tokens

This chapter covers

Tokenization to extract ideas rather than words
The concepts of precision and recall in search
Making trade-offs between precision and recall
Controlling the specificity of matches
Encoding non-textual data into the search engine

At this point, you have a good understanding of why relevance is critical for the success of a search application (chapter 1). You also have a working knowledge of search engine internals (chapter 2) and can debug relevance to pin down why documents match and why they’re given a particular score (chapter 3).

Now, armed with motivation, knowledge, and tools, it’s time to dive into the art of relevance engineering. In this chapter, we focus on text analysis. Proper analysis is the foundation of relevant search. As you saw in chapter 3, analysis controls matching. If analysis is performed correctly, users’ queries will match only the documents that they seek. But if analysis is performed incorrectly, users’ queries will match many irrelevant documents or maybe no documents at all!

4.1. Tokens as document features

Several times we’ve pointed out the relationship between relevance and classification. (Remember our fruit examples?) This relationship is perhaps most obvious when we talk about tokens, because just as the color, shape, and size of a fruit are features by which a fruit may be classified, the tokens pulled from a document are features by which the document can be classified.

Chapter 4. Taming tokens

This chapter covers

4.1. Tokens as document features

4.2. Controlling precision and recall

4.3. Precision and recall—have your cake and eat it too

4.4. Analysis strategies

4.5. Summary