Python Make UMAP fast(er)

Question

Python Make UMAP fast(er)

3.8k Views Asked by LaLeLo At 05 July 2021 at 13:34

I am using UMAP (https://umap-learn.readthedocs.io/en/latest/#) to reduce the dimensions in my data. My dataset contains 4700 samples with 1,2 million features each (which I would like to reduce). However, this takes quite a while despite using 32 CPUs and 120GB RAM. Especially the construction of the embedding is slow and the verbose output did not change in the last 3.5 hours:

UMAP(dens_frac=0.0, dens_lambda=0.0, low_memory=False, n_neighbors=10,
     verbose=True)
Construct fuzzy simplicial set
Mon Jul  5 09:43:28 2021 Finding Nearest Neighbors
Mon Jul  5 09:43:28 2021 Building RP forest with 59 trees
Mon Jul  5 10:06:10 2021 metric NN descent for 20 iterations
     1  /  20
     2  /  20
     3  /  20
     4  /  20
     5  /  20
    Stopping threshold met -- exiting after 5 iterations
Mon Jul  5 10:12:14 2021 Finished Nearest Neighbor Search
Mon Jul  5 10:12:25 2021 Construct embedding

Are there any ways to make this process faster. I am already using a sparse matrix (scipy.sparse.lil_matrix) as described here: https://umap-learn.readthedocs.io/en/latest/sparse.html. Additionally I have installed pynndescent (as described here: https://github.com/lmcinnes/umap/issues/416). My Code is as follows:

from scipy.sparse import lil_matrix
import numpy as np
import umap.umap_ as umap

term_dok_matrix = np.load('term_dok_matrix.npy')
term_dok_mat_lil = lil_matrix(term_dok_matrix, dtype=np.float32)

test = umap.UMAP(a=None, angular_rp_forest=False, b=None,
     force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
     local_connectivity=1.0, low_memory=False, metric='euclidean',
     metric_kwds=None, n_neighbors=10, min_dist=0.1, n_components=2, n_epochs=None, 
     negative_sample_rate=5, output_metric='euclidean',
     output_metric_kwds=None, random_state=None, repulsion_strength=1.0,
     set_op_mix_ratio=1.0, spread=1.0, target_metric='categorical',
     target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
     transform_queue_size=4.0, unique=False, verbose=True).fit_transform(term_dok_mat_lil)

Are there any tricks or ideas how to make the computation faster? Can I change some parameters? Does it help that my matrix consists only of zeros and ones (meaning all non-zeros entries in my matrix are a one).

Original Q&A

There are 2 best solutions below

Le Quang Nam On 18 July 2022 at 08:03

You can do PCA on the data set. The maximum number of PCs is 4700. It is much better than 1.2 billion.

After that, you can calculate the precomputed_knn as:

import umap
from umap.umap_ import nearest_neighbors

precomputed_knn = nearest_neighbors(
        data_pca, n_neighbors = 3000, metric="euclidean",
        metric_kwds=None, angular=False, random_state=1)

then:

umap.UMAP(precomputed_knn=precomputed_knn)

**Leland McInnes** · Accepted Answer · 2021-07-06T03:01:26.393000

With 1.2 million features and only 4700 samples you are going to be better off just precomputing the full distance matrix and passing that in with metric="precomputed". Currently it is expending a lot of work computing nearest neighbors of those 1.2million long vectors. Just brute force will be a lot better.

Python Make UMAP fast(er)

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PERFORMANCE

Related Questions in NUMPY

Related Questions in RUNUMAP

Trending Questions

Popular # Hahtags

Popular Questions