Python Make UMAP fast(er)

3.8k Views Asked by At

I am using UMAP (https://umap-learn.readthedocs.io/en/latest/#) to reduce the dimensions in my data. My dataset contains 4700 samples with 1,2 million features each (which I would like to reduce). However, this takes quite a while despite using 32 CPUs and 120GB RAM. Especially the construction of the embedding is slow and the verbose output did not change in the last 3.5 hours:

UMAP(dens_frac=0.0, dens_lambda=0.0, low_memory=False, n_neighbors=10,
     verbose=True)
Construct fuzzy simplicial set
Mon Jul  5 09:43:28 2021 Finding Nearest Neighbors
Mon Jul  5 09:43:28 2021 Building RP forest with 59 trees
Mon Jul  5 10:06:10 2021 metric NN descent for 20 iterations
     1  /  20
     2  /  20
     3  /  20
     4  /  20
     5  /  20
    Stopping threshold met -- exiting after 5 iterations
Mon Jul  5 10:12:14 2021 Finished Nearest Neighbor Search
Mon Jul  5 10:12:25 2021 Construct embedding

Are there any ways to make this process faster. I am already using a sparse matrix (scipy.sparse.lil_matrix) as described here: https://umap-learn.readthedocs.io/en/latest/sparse.html. Additionally I have installed pynndescent (as described here: https://github.com/lmcinnes/umap/issues/416). My Code is as follows:

from scipy.sparse import lil_matrix
import numpy as np
import umap.umap_ as umap

term_dok_matrix = np.load('term_dok_matrix.npy')
term_dok_mat_lil = lil_matrix(term_dok_matrix, dtype=np.float32)

test = umap.UMAP(a=None, angular_rp_forest=False, b=None,
     force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
     local_connectivity=1.0, low_memory=False, metric='euclidean',
     metric_kwds=None, n_neighbors=10, min_dist=0.1, n_components=2, n_epochs=None, 
     negative_sample_rate=5, output_metric='euclidean',
     output_metric_kwds=None, random_state=None, repulsion_strength=1.0,
     set_op_mix_ratio=1.0, spread=1.0, target_metric='categorical',
     target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
     transform_queue_size=4.0, unique=False, verbose=True).fit_transform(term_dok_mat_lil)

Are there any tricks or ideas how to make the computation faster? Can I change some parameters? Does it help that my matrix consists only of zeros and ones (meaning all non-zeros entries in my matrix are a one).

2

There are 2 best solutions below

1
Leland McInnes On BEST ANSWER

With 1.2 million features and only 4700 samples you are going to be better off just precomputing the full distance matrix and passing that in with metric="precomputed". Currently it is expending a lot of work computing nearest neighbors of those 1.2million long vectors. Just brute force will be a lot better.

0
Le Quang Nam On

You can do PCA on the data set. The maximum number of PCs is 4700. It is much better than 1.2 billion.

After that, you can calculate the precomputed_knn as:

import umap
from umap.umap_ import nearest_neighbors

precomputed_knn = nearest_neighbors(
        data_pca, n_neighbors = 3000, metric="euclidean",
        metric_kwds=None, angular=False, random_state=1)

then:

umap.UMAP(precomputed_knn=precomputed_knn)