UMAP for discovering your data

Jan 8, 2023

José Luis Marina

Introduction

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction and data clustering method. Its goal is to map high-dimensional data into a lower-dimensional space in a way that preserves some structure of the original data. This technique is very useful for visualizing and better understanding complex datasets.

At taniwa, we have been working with UMAP for some time and have found it to be a very useful tool for better understanding data. In this article, we will see how UMAP works and how we can use it in our daily work.

Learn more

The main purpose of clustering is to group sets of unlabeled objects to build subsets of data known as clusters. Each cluster is formed by a collection of elements that are similar to each other but have differentiating elements with respect to other objects belonging to the dataset that may form an independent cluster.

This type of process is applied in unsupervised machine learning models, meaning we do without prior labeling or classification. Clustering, in this case, serves to segment data into groups of similar dimensions based on features to facilitate this process.

We also apply dimensionality reduction strategies that simplify models, allowing us to identify which features are more or less important in the dataset.

The model we are talking about is UMAP [Uniform Manifold Approximation and Projection], which is a dimensionality reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimensionality reduction. The details of the underlying mathematics can be found in the paper on ArXiv.

At taniwa, we like UMAP for (1) its ability to clearly separate clusters and (2) the intuitiveness of its graphical representations.

UMAP does an excellent job of separating elements compared to t-SNE or PCA. It also often works better at preserving the global structure of the data than t-SNE. This means that, in general, it provides a global and complete picture of the data while preserving the close relationships between instances.

Be aware, you always have to adjust the algorithm to the data, and there are more options than what we will see here, as mentioned in this other article.

UMAP works as follows:

First, the distance between all data points in the high-dimensional space is calculated.
Then, a graph is constructed in which each data point is a node and the edges of the graph connect nearby data points.
Next, an optimization algorithm is used to minimize a cost function that measures the distortion of the mapping from high to low dimension. This is done iteratively until an optimal mapping is found.
Once the optimal mapping is obtained, the data can be visualized in the lower-dimensional space, and traditional clustering techniques can be used to group the data into clusters.

As for how UMAP compares to other clustering algorithms, there are some advantages and disadvantages to consider. In general, UMAP is a very versatile dimensionality reduction and clustering technique and can work well on a wide variety of datasets. However, it can sometimes be slower than other algorithms due to the complexity of the optimization process used.

It is also important to note that UMAP is not suitable for all types of data and may not work as well on datasets with a very complicated structure or a lot of noise.

Example

A simple example of UMAP is the following:

data
Machine Learning
UMAP