chapter eleven

11 Maximizing variance and similarity: principal component analysis, t-SNE, and UMAP

 

This chapter covers:

  • Why do we need dimension reduction?
  • What are the problems of high dimensionality and colinearity?
  • What is principal component analysis?
  • What is t-SNE?
  • What is UMAP?

Our first stop in dimension reduction brings us to three very powerful and popular algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). All three of these dimension reduction algorithms turns a set of (potentially many) variables into a smaller number of variables which retain as much of the original, multi-dimensional information as possible. Each algorithm does this in a different way.

Note

The first historical example of dimension reduction was a two-dimensional map! Another form of dimension reduction that we encounter in our daily lives is the compression of audio into formats like .mp3 and .wav.

11.1  Why dimension reduction: visualization, curse of dimensionality and colinearity

11.1.1  Visualizing a dataset

11.1.2  Curse of dimensionality

11.1.3  Colinearity

11.1.4  Dimension reduction mitigates the curse of dimensionality and colinearity

11.2  What is principal component analysis?

11.3  Building our first principal components analysis model

11.3.1  Loading and exploring the banknote dataset

11.3.2  Performing PCA

11.3.3  Plotting the result of our PCA

11.3.4  Computing the component scores of new data

11.4  What is t-SNE?

11.5  Building our first t-SNE embedding

11.5.1  Performing t-SNE

11.5.2  Plotting the result of t-SNE

11.6  What is UMAP?

11.7  Building our first UMAP model

11.7.1  Performing UMAP