7 Text analysis
This chapter covers
- Overview of text analysis
- Anatomy of an analyzer
- Built-in analyzers
- Developing custom analyzers
- Understanding tokenizers
- Learning about character and token filters
Elasticsearch does a lot of ground (and grunt) work behind the scenes on incoming textual data. It preps data to make it efficiently stored and searchable. In a nutshell, Elasticsearch cleans the text fields, breaks the text data into individual tokens, and enriches the tokens before storing them in the inverted indices. When a search query is carried out, the query string is searched against the stored tokens and, accordingly, any matches are retrieved and scored. This process is termed as text analysis and is usually expected to be performed on all text fields.
The aim of the text analysis is to not just return search results quickly and efficiently, but to retrieve relevant results too. The groundwork is carried out in the name of text analysis by employing so-called analyzers. The analyzers are software components prebuilt to inspect the input text according to some rules. If the user searches for “K8s”, for example, we should be able to fetch books on Kubernetes, even though the criteria was K8s. Similarly, if the search word is “emojis”, the search engine should be capable of extracting the appropriate results. All this and many more search criteria are honored by the engine due to the way we configure the analyzers.