chapter two

2 Working with natural language

This chapter covers

The hidden structures in unstructured data
A search-centric philosophy of language
Exploring distributional semantics and vector-based embeddings
Modeling domain-specific knowledge
Challenges with natural language and querys
Applying natural language understanding techniques to both content and signals

In the first chapter, we provided a high-level overview of what it means to build an AI-powered search system. Throughout the rest of the book, we’ll explore and demonstrate the numerous ways your search application can continuously learn from your content and your users’ behavioral signals to better understand your content, your users, and your domain, and to ultimately deliver users the answers they need. We will get much more hands-on in chapter 3, firing up a search server of your choice and a data processing layer (Apache Spark) and starting with the first of our Jupyter notebooks, which we’ll use throughout the book to walk through many step-by-step examples.

2.1 The myth of unstructured data

2.1.1 Types of unstructured data

2.1.2 Data types in traditional structured databases

2.1.3 Joins, fuzzy joins, and entity resolution in unstructured data

2.2 The structure of natural language

2.3 Distributional semantics and embeddings

2.4 Modeling domain-specific knowledge

2.5 Challenges in natural language understanding for search

2.5.1 The challenge of ambiguity (polysemy)

2.5.2 The challenge of understanding context

2.5.3 The challenge of personalization

2.5.4 Challenges interpreting queries vs. documents

2.5.5 Challenges interpreting query intent

2.6 Content + signals: The fuel powering AI-powered search

Summary