chapter seven

7 Text analysis

This chapter covers

Overview of text analysis
Anatomy of an analyzer
Built-in analyzers
Developing custom analyzers
Understanding tokenizers
Learning about character and token filters

Elasticsearch does a lot of ground (and grunt) work behind the scenes on incoming textual data. It preps data to make it efficiently stored and searchable. In a nutshell, Elasticsearch cleans the text fields, breaks the text data into individual tokens, and enriches the tokens before storing them in the inverted indices. When a search query is carried out, the query string is searched against the stored tokens and, accordingly, any matches are retrieved and scored. This process is termed as text analysis and is usually expected to be performed on all text fields.

The aim of the text analysis is to not just return search results quickly and efficiently, but to retrieve relevant results too. The groundwork is carried out in the name of text analysis by employing so-called analyzers. The analyzers are software components prebuilt to inspect the input text according to some rules. If the user searches for “K8s”, for example, we should be able to fetch books on Kubernetes, even though the criteria was K8s. Similarly, if the search word is “emojis”, the search engine should be capable of extracting the appropriate results. All this and many more search criteria are honored by the engine due to the way we configure the analyzers.

7.1 Overview

7.1.1 Querying unstructured data

7.1.2 Analyzers to the rescue

7.2 Analyzer module

7.2.1 Tokenization

7.2.2 Normalization

7.2.3 Anatomy of an analyzer

7.2.4 Testing analyzers

7.3 Built-in analyzers

7.3.1 Standard analyzer

7.3.2 Simple analyzer

7.3.3 Whitespace analyzer

7.3.4 Keyword analyzer

7.3.5 Fingerprint analyzer

7.3.6 Pattern analyzer

7.3.7 Language analyzers

7.4 Custom analyzers

7.4.1 Advanced customization

7.5 Specifying analyzers

7.5.1 Analyzers for indexing

7.5.2 Analyzers for searching

7.6 Character filters

7.6. Types of character filters

7.7 Tokenizers

7.7.1 Standard tokenizer

7.7.2 N-gram and edge_ngram tokenizers

7.7.3 Other tokenizers

7.8 Token filters

7.8.1 Stemmer filter

7.8.2 Shingle filter

7.8.3 Synonym filter