OPTICS parallelism

1.4k Views Asked by At

I have the following script (optics.py) to estimate clustering with precomuted distances:

from sklearn.cluster import OPTICS
import numpy as np

distances = np.load(r'distances.npy')
clust = OPTICS(metric='precomputed', n_jobs=-1)
clust = clust.fit(distances)

Looking at htop results I can see that only one CPU core is used

enter image description here

despite the fact scikit runs clustering in multiple processes:

enter image description here

Why n_jobs=-1 has not resulted in using all the CPU cores?

3

There are 3 best solutions below

0
Dmitrii Rashchenko On BEST ANSWER

I also face this problem. According to some papers (for example this, see abstract), OPTICS is known as challenging to do it in parallel because of its sequential nature. So, probably, sklearn tries to use all cores when you use n_jobs=-1, but there is nothing to run on extra cores.

Probably you should consider other clustering algorithms, which are more parallelism-friendly, for example @paul-brodersen in comments suggests to use HDBSCAN. But it seems that sklearn does not have such parallel alternative for optics, so you need to use other packages.

0
anactualtoaster On

Both OPTICS and HDBSCAN suffer from a lack of parallelization. They both are sequential in nature and thus can't be passed onto a simple joblib.Parallel like DBSCAN can.

If you're looking to improve speed, one of the benefits with HDBSCAN is the ability to create an inference model that you can use to make predictions without having to run the whole cluster again. That's what I use to avoid having to run a very slow cluster operation every time I need to classify my data.

1
Shane Grigsby On

I'm the primary author of the sklearn OPTICS module. Parallelism is difficult because there is an ordering loop which cannot be run in parallel; that said, the most computationally intensive task is distance calculations, and these can be run in parallel. More specifically, sklearn OPTICS calculates the upper triangle distance matrix one row at a time, starting with 'n' distance lookups, and decreasing to 'n-1, n-2' lookups for a total of n-squared / 2 distance calculations... the problem is that parallelism in sklearn is generally handled by joblib, which uses processes (not threads), which have rather high overhead for creation and destruction when used in a loop. (i.e., you create and destroy the process workers per row as you loop through the data set, and 'n' setup/teardowns of processes has more overhead then the parallelism benefit you get from joblib--this is why njobs is disabled for OPTICS)

The best way to 'force' parallelism in OPTICS is probably to define a custom distance metric that runs in parallel-- see this post for a good example of this:

https://medium.com/aspectum/acceleration-for-the-nearest-neighbor-search-on-earths-surface-using-python-513fc75984aa

One of the example's above actually forces the distance calculation onto a GPU, but still uses sklearn for the algorithm execution.