3 Text embeddings

 

This chapter covers

  • Preparing texts for deep learning using word and document embeddings
  • Using self-developed vs. pretrained embeddings
  • Implementing word similarity with Word2Vec
  • Retrieving documents using Doc2Vec

After reading this chapter, you will have a practical command of basic and popular text embedding algorithms, and you will have developed insight into how to use embeddings for NLP. We will go through a number of concrete scenarios to reach that goal. But first, let’s review the basics of embeddings.

3.1 Embeddings

Embeddings are procedures for converting input data into vector representations. As mentioned in chapter 1, a vector is like a container (such as an array) containing numbers. Every vector lives in a multidimensional vector space, as a single point, with every value interpreted as a value across a specific dimension. Embeddings result from systematic, well-crafted procedures for projecting (embedding) input data into such a space.

We have seen ample vector representations of texts in chapters 1 and 2, such as one-hot vectors (binary-valued vectors with one bit “on” for a specific word), used for bag-of-word representations and frequency- or TF.IDF-based vectors. All these vector representations were created by embeddings.

Let’s work our way up from the simplest of embeddings to the more complex ones. Recall from chapter 1 that there are two major types of vector encodings, depending

3.1.1 Embedding by direct computation: Representational embeddings

3.1.2 Learning to embed: Procedural embeddings

3.2 From words to vectors: Word2Vec

3.3 From documents to vectors: Doc2Vec

Summary