3 Word and document embeddings

This chapter covers

What word embeddings are and why they are important
How the Skip-gram model learns word embeddings and how to implement it
What GloVe embeddings are and how to use pretrained vectors
How to use Doc2Vec and fastText to train more advanced embeddings
How to visualize word embeddings

In chapter 2, I pointed out that neural networks can deal only with numbers, whereas almost everything in natural language is discrete (i.e., separate concepts). To use neural networks in your NLP application, you need to convert linguistic units to numbers, such as vectors. For example, if you wish to build a sentiment analyzer, you need to convert the input sentence (sequence of words) into a sequence of vectors. In this chapter, we’ll discuss word embeddings, which are the key to achieving this bridging. We’ll also touch upon a couple of fundamental linguistic components that are important in understanding embeddings and neural networks in general.

3.1 Introducing embeddings

As we discussed in chapter 2, an embedding is a real-valued vector representation of something that is usually discrete. In this section, we’ll revisit what embeddings are and discuss in detail what roles they play in NLP applications.

3.1.1 What are embeddings?

A word embedding is a real-valued vector representation of a word. If you find the concept of vectors intimidating, think of them as single-dimensional arrays of float numbers, like the following:

3.1.2 Why are embeddings important?

3 Word and document embeddings

This chapter covers

3.1 Introducing embeddings

3.1.1 What are embeddings?

3.1.2 Why are embeddings important?

3.2 Building blocks of language: Characters, words, and phrases

3.2.1 Characters

3.2.2 Words, tokens, morphemes, and phrases

3.2.3 N-grams

3.3 Tokenization, stemming, and lemmatization

3.3.1 Tokenization

3.3.2 Stemming

3.3.3 Lemmatization

3.4 Skip-gram and continuous bag of words (CBOW)

3 Word and document embeddings

This chapter covers

3.1 Introducing embeddings

3.1.1 What are embeddings?

3.1.2 Why are embeddings important?

3.2 Building blocks of language: Characters, words, and phrases

3.2.1 Characters

3.2.2 Words, tokens, morphemes, and phrases

3.2.3 N-grams

3.3 Tokenization, stemming, and lemmatization

3.3.1 Tokenization

3.3.2 Stemming

3.3.3 Lemmatization

3.4 Skip-gram and continuous bag of words (CBOW)

Unable to load book!