Chapter 12. Document ranking
This chapter covers
- The vector space model
- Extending the DefaultSimilarity class
- Writing your own Scoring and Weight classes
- Relevancy
Have you ever found yourself saying something like, “I need to score the results of my queries slightly differently. How do I go about that?” Or maybe, “I don’t care about how many times what I’m looking for occurs in a result; I just want to know whether or not it does.” The authors have even heard, “The lengths of the documents I’m querying should have nothing to do with the scores of the results.”
How documents are scored when they are retrieved during a search is a very hot topic among users. Questions appear every day on the Lucene mailing list at java-user@lucene.apache.org echoing these same concerns. If you wish to subscribe to this list, you can do so at http://lucene.apache.org/java/docs/mailinglists.html.
We’re going to answer many of those questions here, and in doing so we’ll cover one of the most difficult topics in the information-retrieval realm. We’ll start by utilizing the classic vector space model to score documents against a query. We’ll then cover Lucene’s scoring methodology and run through examples of how to change document scores. We’ll build our own classes and extend others to score things the way we want them. Finally, we’ll talk about document relevance and how to improve it. There’s a lot to cover here, so let’s get started.