3 Introduction to information search

 

This chapter covers

  • Implementing your information-retrieval algorithm
  • Exploring useful NLP techniques, including stemming and stopwords removal
  • Assessing importance of different bits of information in a search
  • Evaluating the relevance of the documents to the information need

This chapter will focus on algorithms for an information search, which also has a more technical name—information retrieval. It will explain the steps in the search algorithm from beginning to end, and by the end of this chapter you will be able to implement your own search algorithm.

You might have come across the term information retrieval in the context of search engines; for example, Google famously started its business by providing a powerful search algorithm that kept improving over time. The search for information, however, is a basic need that you may face beyond searching online. For instance, every time you search for the files on your computer, you are performing a sort of information retrieval. In fact, the task predates the digital era. Before computers and the internet became a commodity, one had to manually wade through paper copies of encyclopedias, books, documents, files, and so on. Thanks to the technology, the algorithms these days help you do many of these tasks automatically.

3.1 Understanding the task

3.1.1 Data and data structures

3.1.2 Boolean search algorithm

3.2 Processing the data further

3.2.1 Preselecting the words that matter: Stopwords removal

3.2.2 Matching forms of the same word: Morphological processing

3.3 Information weighing

3.3.1 Weighing words with term frequency

3.3.2 Weighing words with inverse document frequency

3.4 Practical use of the search algorithm

3.4.1 Retrieval of the most similar documents

3.4.2 Evaluation of the results

3.4.3 Deploying search algorithm in practice