Chapter 5. Analyzing your data

 

This chapter covers

  • Analyzing your document’s text with Elasticsearch
  • Using the analysis API
  • Tokenization
  • Character filters
  • Token filters
  • Stemming
  • Analyzers included with Elasticsearch

So far we’ve covered indexing and searching your data, but what actually happens when you send data to Elasticsearch? What happens to the text sent in a document to Elasticsearch? How can Elasticsearch find specific words within sentences, even when the case changes? For example, when a user searches for “nosql,” generally you’d like a document containing the sentence “share your experience with NoSql & big data technologies” to match, because it contains the word NoSql. You can use the information you learned in the previous chapter to do a query_string search for “nosql” and find the document. In this chapter you’ll learn why using the query string query will return the document. Once you finish this chapter you’ll have a better idea how Elasticsearch’s analysis allows you to search your document set in a more flexible manner.

5.1. What is analysis?

Analysis is the process Elasticsearch performs on the body of a document before the document is sent off to be added to the inverted index. Elasticsearch goes through a number of steps for every analyzed field before the document is added to the index:

5.2. Using analyzers for your documents

5.3. Analyzing text with the analyze API

5.4. Analyzers, tokenizers, and token filters, oh my!

5.5. Ngrams, edge ngrams, and shingles

5.6. Stemming

5.7. Summary