chapter seven

7 Text analysis

This chapter covers

Overview of text analysis
Anatomy of an analyzer
Built-in analyzers
Developing custom analyzers
Understanding tokenizers
Learning about character and token filters

Elasticsearch does a lot of ground (and grunt) work behind the scenes on incoming textual data. It preps data to make it efficiently stored and searchable. In a nutshell, Elasticsearch cleans text fields, breaks text data into individual tokens, and enriches the tokens before storing them in inverted indexes. When a search query is carried out, the query string is searched against the stored tokens, and any matches are retrieved and scored. This process of breaking the text into individual tokens and storing it in internal memory structures is called text analysis.

The aim of text analysis is not just to return search results quickly and efficiently, but also to retrieve relevant results. The work is carried out using analyzers: software components prebuilt to inspect the input text according to various rules. If the user searches for “K8s”, for example, we should be able to fetch books on Kubernetes. Similarly, if search sentences include emojis such as ☕ (coffee), the search engine should be able to extract coffee-appropriate results. These and many more search criteria are honored by the engine due to the way we configure the analyzers.

7 Text analysis

This chapter covers

7.1 Overview

7.1.1 Querying unstructured data

7.1.2 Analyzers to the rescue

7.2 Analyzer modules

7.2.1 Tokenization

7.2.2 Normalization

7.2.3 Anatomy of an analyzer

7.2.4 Testing analyzers

7.3 Built-in analyzers