concept document in category elasticsearch

appears as: documents, document, A document, documents, document, The document, The documents
Relevant Search: With applications for Solr and Elasticsearch

This is an excerpt from Manning's book Relevant Search: With applications for Solr and Elasticsearch.

We’re easily tricked into seeing search as a single problem. In reality, search applications differ greatly from one another. It’s true that a typical search application lets the user enter text, filter through documents, and interact with a list of ranked results. But don’t be fooled by superficial appearances. Each application has dramatically different relevance expectations. Let’s look at some common classes of search applications to appreciate that your application likely has its own unique definition of relevance.

In search applications, the notion of a document is central, because documents are items being stored, searched, and returned. Documents are what search is all about! When you issue a query to a search engine, you’re searching a collection of documents. These may be literal documents such as text files on a server. Or, more generally, documents may correspond to content such as:

At the core of a search engine is a data structure called the inverted index, analogous to the physical index at the back of this book. An inverted index is composed of two main pieces: a term dictionary and a postings list. The term dictionary is a sorted list of all terms that occur in a given field across a set of documents. For each term in the dictionary, there’s a corresponding list of documents that contain that term. This list of documents is referred to as the postings for a particular term. To understand this more clearly, let’s look at an example. Consider the set of documents shown in the following listing.

Listing 2.1. Documents

The term dictionary and postings list for this simple set of documents are presented in the following two listings, respectively.

Because we know that we’re dealing with a search engine, we can be more specific about the steps in search’s ETL process. As illustrated in figure 2.5, these steps are extraction, enrichment, analysis, and indexing. Here, extraction is the process of retrieving the documents from their sources. The optional step of enrichment adds information to the documents useful for relevance. Analysis, as you saw earlier in this chapter, converts document text or data into tokens that enable matching. And finally, indexing is the process of placing data into those data structures.

Figure 2.5. The full search ETL pipeline: extraction, enrichment, analysis, and indexing

We cover extraction and enrichment rather generically. Many times, the details of these steps depend entirely on how your source data is stored. Indexing concerns us only as it pertains to enabling/disabling features for enabling relevance. Analysis, however, has overriding importance to search relevance and is expounded on here. It’s also discussed at several points throughout the book. Recall, analysis transforms raw text and data from the documents into tokens. These tokens represent the document’s features. Engineering these to match features from a user’s query is critical to satisfy the user’s information need.

2.3.1. Extracting content into documents

Crafting documents that can be easily retrieved can be just as important to relevance as manipulating the innards of the search engine. You’ll see in particular later in this book that content curation (chapter 10) and careful field construction (chapters 47) often dictate whether a relevance solution is easy or hard. The basis for this work lies in controlling the extraction and enrichment process, which we outline in the following two sections.

Where do your search documents come from? Data has many possible sources. If you’re fortunate, documents can be easily retrieved from a database or external data repository. In this case, extraction may be as simple as crafting a simple query to dump the necessary data. If you’re less fortunate, you might have to look for your documents—for instance, by crawling web pages or filesystems. And if you’re less fortunate still, you might find that your data is locked away behind files that require complex additional processing (such as MS Word documents, PDFs, or, worst of all, images of scanned text). But no matter the case, the end result of extraction is a set of documents to be sent to the search engine. Here, a document may be exactly like the document described in section 2.1.1, a collection of typed fields that contain various values. Or, for search engines such as Elasticsearch, these can be complex hierarchical documents represented as JSON.

The main takeaway is to own your extraction process. Extensive strategies, projects, plugins, and products exist for transforming data from a primary data source to a search engine. The permutations are so numerous that they’d fill dozens of books. We don’t cover these options. But you should understand how your extraction process works so that you can control the structure of your documents. Simply living with the structure of data as plopped into the search engine from your source systems can limit your options. In this book’s examples, we take control of this process by rolling our own code to extract documents from an external system and build search documents directly.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest