2 Working with natural language

This chapter covers:

Uncovering the hidden structure in unstructured data
A search-centric philosophy of language and natural language understanding
Exploring distributional semantics and word embeddings
Modeling domain-specific knowledge
Tackling challenges in natural language understanding and query interpretation
Applying natural language learning techniques to both content and signals

In the first chapter, we provided a high-level overview of what it means to build an AI-powered search system. Throughout the rest of the book, we’ll explore and demonstrate the numerous ways your search application can continuously learn from your content and your user behavioral signals in order to better understand your content, your users, and your domain, and to ultimately deliver users the answers they need. We will get much more hands on in chapter three, firing up a search server (Apache Solr), a data processing layer (Apache Spark), and starting with the first of our Jupyter notebooks, which we’ll use throughout the book to walk through many step-by-step examples.

2.1 The myth of unstructured data

2.1.1 Types of unstructured data

2.1.2 Data types in traditional structured databases

2.1.3 Joins, fuzzy joins, and entity resolution in unstructured data

2.2 The structure of natural language

2.3 Distributional semantics and word embeddings

2.4 Modeling domain-specific knowledge

2.5 Challenges in natural language understanding for search

2.5.1 The challenge of ambiguity (polysemy)

2.5.2 The challenge of understanding context

2.5.3 The challenge of personalization

2.5.4 Challenges interpreting queries vs. documents

2.5.5 Challenges interpreting query intent

2.6 The fuel powering AI-powered search

2.7 Summary