Chapter 14. Maximizing similarity with t-SNE and UMAP
This chapter covers
- Understanding nonlinear dimension reduction
- Using t-distributed stochastic neighbor embedding
- Using uniform manifold approximation and projection
In the last chapter, I introduced you to PCA as our first dimension-reduction technique. While PCA is a linear dimension-reduction algorithm (it finds linear combinations of the original variables), sometimes the information in a set of variables can’t be extracted as a linear combination of these variables. In such situations, there are a number of nonlinear dimension-reduction algorithms we can turn to, such as t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP).
t-SNE is one of the most popular nonlinear dimension-reduction algorithms. It measures the distance between each observation in the dataset and every other observation, and then randomizes the observations across (usually) two new axes. The observations are then iteratively shuffled around these new axes until their distances to each other in this two-dimensional space are as similar to the distances in the original high-dimensional space as possible.