3 Word and document embeddings

 

This chapter covers

  • What word embeddings are and why they are important
  • How the Skip-gram model learns word embeddings and how to implement it
  • What GloVe embeddings are and how to use pretrained vectors
  • How to use Doc2Vec and fastText to train more advanced embeddings
  • How to visualize word embeddings

In chapter 2, I pointed out that neural networks can deal only with numbers, whereas almost everything in natural language is discrete (i.e., separate concepts). To use neural networks in your NLP application, you need to convert linguistic units to numbers, such as vectors. For example, if you wish to build a sentiment analyzer, you need to convert the input sentence (sequence of words) into a sequence of vectors. In this chapter, we’ll discuss word embeddings, which are the key to achieving this bridging. We’ll also touch upon a couple of fundamental linguistic components that are important in understanding embeddings and neural networks in general.

3.1 Introducing embeddings

As we discussed in chapter 2, an embedding is a real-valued vector representation of something that is usually discrete. In this section, we’ll revisit what embeddings are and discuss in detail what roles they play in NLP applications.

3.1.1 What are embeddings?

A word embedding is a real-valued vector representation of a word. If you find the concept of vectors intimidating, think of them as single-dimensional arrays of float numbers, like the following:

3.1.2 Why are embeddings important?

 
 
 
 

3.2 Building blocks of language: Characters, words, and phrases

 
 
 

3.2.1 Characters

 
 
 
 

3.2.2 Words, tokens, morphemes, and phrases

 
 

3.2.3 N-grams

 
 
 

3.3 Tokenization, stemming, and lemmatization

 
 
 
 

3.3.1 Tokenization

 
 

3.3.2 Stemming

 
 

3.3.3 Lemmatization

 
 

3.4 Skip-gram and continuous bag of words (CBOW)

 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest