chapter five

Chapter 5. Analyzing your data

This chapter covers

Analyzing your document’s text with Elasticsearch
Using the analysis API
Tokenization
Character filters
Token filters
Stemming
Analyzers included with Elasticsearch

So far we’ve covered indexing and searching your data, but what actually happens when you send data to Elasticsearch? What happens to the text sent in a document to Elasticsearch? How can Elasticsearch find specific words within sentences, even when the case changes? For example, when a user searches for “nosql,” generally you’d like a document containing the sentence “share your experience with NoSql & big data technologies” to match, because it contains the word NoSql. You can use the information you learned in the previous chapter to do a query_string search for “nosql” and find the document. In this chapter you’ll learn why using the query string query will return the document. Once you finish this chapter you’ll have a better idea how Elasticsearch’s analysis allows you to search your document set in a more flexible manner.

5.1. What is analysis?

Analysis is the process Elasticsearch performs on the body of a document before the document is sent off to be added to the inverted index. Elasticsearch goes through a number of steps for every analyzed field before the document is added to the index:

Chapter 5. Analyzing your data

This chapter covers

5.1. What is analysis?

5.2. Using analyzers for your documents

5.3. Analyzing text with the analyze API

5.4. Analyzers, tokenizers, and token filters, oh my!

5.5. Ngrams, edge ngrams, and shingles

5.6. Stemming

5.7. Summary