UMAP
Created: 2022-04-21 16:33
#note
UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) is much faster and more scalable than t-SNE, while also preserving the global structure of the data much better. This makes it useful for both visualization and as a preprocessing dimensionality reduction step to use before clustering.
It is based on three assumptions:
- The data is uniformly distributed on Riemannian manifold;
- The Riemannian metric is locally constant (or can be approximated as such);
- The manifold is locally connected.
From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.
It is considered to be better than tSNE because it can scale much better for increasing sample size, it is faster and it preserves global data structure (and other reasons as explained here).
There are points of concern about using UMAP with density-based clustering methods as HDBSCAN because UMAP, like t-SNE, does not completely preserve density. That means that we can't be sure that the clusters we obtain are "real" and not just artifats of t-SNE. A discussion on t-SNE's shortcomings this is available here. Anyway UMAP is much better than t-SNE to preserve the data's structure and can be used as preprocessing step for clustering as shown here.