Chapter 12. Document ranking

 

This chapter covers

  • The vector space model
  • Extending the DefaultSimilarity class
  • Writing your own Scoring and Weight classes
  • Relevancy

Have you ever found yourself saying something like, “I need to score the results of my queries slightly differently. How do I go about that?” Or maybe, “I don’t care about how many times what I’m looking for occurs in a result; I just want to know whether or not it does.” The authors have even heard, “The lengths of the documents I’m querying should have nothing to do with the scores of the results.”

How documents are scored when they are retrieved during a search is a very hot topic among users. Questions appear every day on the Lucene mailing list at java-user@lucene.apache.org echoing these same concerns. If you wish to subscribe to this list, you can do so at http://lucene.apache.org/java/docs/mailinglists.html.

We’re going to answer many of those questions here, and in doing so we’ll cover one of the most difficult topics in the information-retrieval realm. We’ll start by utilizing the classic vector space model to score documents against a query. We’ll then cover Lucene’s scoring methodology and run through examples of how to change document scores. We’ll build our own classes and extend others to score things the way we want them. Finally, we’ll talk about document relevance and how to improve it. There’s a lot to cover here, so let’s get started.

12.1. Scoring documents

12.2. Exploring Lucene’s scoring approach and the DefaultSimilarity class

12.3. Scoring things my way

12.4. Document relevance

12.5. Summary