Chapter 4. Taming tokens
This chapter covers
- Tokenization to extract ideas rather than words
- The concepts of precision and recall in search
- Making trade-offs between precision and recall
- Controlling the specificity of matches
- Encoding non-textual data into the search engine
At this point, you have a good understanding of why relevance is critical for the success of a search application (chapter 1). You also have a working knowledge of search engine internals (chapter 2) and can debug relevance to pin down why documents match and why they’re given a particular score (chapter 3).
Now, armed with motivation, knowledge, and tools, it’s time to dive into the art of relevance engineering. In this chapter, we focus on text analysis. Proper analysis is the foundation of relevant search. As you saw in chapter 3, analysis controls matching. If analysis is performed correctly, users’ queries will match only the documents that they seek. But if analysis is performed incorrectly, users’ queries will match many irrelevant documents or maybe no documents at all!
Several times we’ve pointed out the relationship between relevance and classification. (Remember our fruit examples?) This relationship is perhaps most obvious when we talk about tokens, because just as the color, shape, and size of a fruit are features by which a fruit may be classified, the tokens pulled from a document are features by which the document can be classified.