Chapter 4. Lucene’s analysis process

This chapter covers

Understanding the analysis process
Using Lucene’s core analysis classes
Writing custom analyzers
Handling foreign languages

Analysis, in Lucene, is the process of converting field text into its most fundamental indexed representation, terms. These terms are used to determine what documents match a query during searching. For example, if you indexed this sentence in a field the terms might start with for and example, and so on, as separate terms in sequence. An analyzer is an encapsulation of the analysis process. An analyzer tokenizes text by performing any number of operations on it, which could include extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form (lemmatization). This process is also called tokenization, and the chunks of text pulled from a stream of text are called tokens. Tokens, combined with their associated field name, are terms.

Lucene’s primary goal is to facilitate information retrieval. The emphasis on retrieval is important. You want to throw gobs of text at Lucene and have them be richly searchable by the individual words within that text. In order for Lucene to know what “words” are, it analyzes the text during indexing, extracting it into terms. These terms are the primitive building blocks for searching.

4.1. Using analyzers

4.2. What’s inside an analyzer?

4.3. Using the built-in analyzers

4.4. Sounds-like querying

4.5. Synonyms, aliases, and words that mean the same

4.6. Stemming analysis

4.7. Field variations

4.8. Language analysis issues

4.9. Nutch analysis

4.10. Summary