i am trying to use sklearn.cluster.OPTICS to cluster an already computed similarity (distance) matrix filled with normalized cosine distances (0.0 to 1.0)
but no matter what i give in max_eps and eps i don't get any clusters out.
Later on i would need to run OPTICS on a similarity matrix of more than 129'000 x 129'000 items hopefully relying on Dask to keep memory footprint low.
I am extracting fasttext vectors for a small amount of words (each vector 300 dimensions) and use dask-distance to create a similarity matrix from the vectors.
The result is a matrix looking like this:
sim == [[0. 0.56742118 0.42776633 0.42344265 0.84878847 0.87984235
0.87468601 0.95224451 0.89341788 0.80922083]
[0.56742118 0. 0.59779273 0.62900345 0.83004028 0.87549904
0.887784 0.8591598 0.80752158 0.80960947]
[0.42776633 0.59779273 0. 0.45120935 0.79292425 0.78556189
0.82378645 0.93107747 0.83290157 0.85349163]
[0.42344265 0.62900345 0.45120935 0. 0.81379353 0.83985011
0.8441614 0.89824009 0.77074847 0.81297649]
[0.84878847 0.83004028 0.79292425 0.81379353 0. 0.15328565
0.36656755 0.79393195 0.76615941 0.83415538]
[0.87984235 0.87549904 0.78556189 0.83985011 0.15328565 0.
0.36000894 0.7792588 0.77379052 0.83737352]
[0.87468601 0.887784 0.82378645 0.8441614 0.36656755 0.36000894
0. 0.82404421 0.86144969 0.87628284]
[0.95224451 0.8591598 0.93107747 0.89824009 0.79393195 0.7792588
0.82404421 0. 0.521453 0.5784272 ]
[0.89341788 0.80752158 0.83290157 0.77074847 0.76615941 0.77379052
0.86144969 0.521453 0. 0.629014 ]
[0.80922083 0.80960947 0.85349163 0.81297649 0.83415538 0.83737352
0.87628284 0.5784272 0.629014 0. ]]
which looks like something i could cluster using a threshold of 0.8 for example
from dask import array as da
import dask_distance
import logging
import numpy as np
from sklearn.cluster import OPTICS
from collections import defaultdict
log = logging.warning
np.set_printoptions(suppress=True)
if __name__ == "__main__":
array = np.load("vectors.npy")
vectors = da.from_array(array)
sim = dask_distance.cosine(vectors, vectors)
sim = sim.clip(0.0, 1.0)
m = np.max(sim)
c = OPTICS(eps=-1, cluster_method="dbscan", metric="precomputed", algorithm="brute")
clusters = c.fit(sim)
words = [
"icecream",
"cake",
"cream",
"ice",
"dog",
"cat",
"animal",
"car",
"truck",
"bus",
]
cs = defaultdict(list)
for index, c in enumerate(clusters.labels_):
cs[c].append(words[index])
for v in cs.values():
log(v)
log(clusters.labels_)
which prints
['icecream', 'cake', 'cream', 'ice', 'dog', 'cat', 'animal', 'car', 'truck', 'bus']
[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
but i was expecting there to be several clusters.
I have tried many different values for all the supported parameters in OPTICS but have not been able to yield anything usable or even more clusters than just one.
i am using following versions:
python -V
Python 3.7.3
sklearn.__version__
'0.21.3'
dask.__version__
'2.3.0'
numpy.__version__
'1.17.0'
Here is how it looks with using sklearn DBSCAN instead
...
sim = sim.astype(np.float32)
c = DBSCAN(eps=0.7, min_samples=1, metric="precomputed", n_jobs=-1)
clusters = g.fit(sim)
...
yields
['icecream', 'cake', 'cream', 'ice']
['dog', 'cat', 'animal']
['car', 'truck', 'bus']
[0 0 0 0 1 1 1 2 2 2]
Which is very correct, but has a much higher memory footprint (OPTICS apparently only needs to calculate half of the matrix)
Have you tried to estimate how much memory a 129000x129000 matrix needs - and how long it will take you to compute that and work with that?!? I strongly doubt that dask will be that helpful here in scaling this. You will need to use some indexing approach to avoid any O(n²) cost in the first place. Cutting O(n²) by a factor of k with k nodes just doesn't get you far enough to be scalable.
When you use
"precomputed"
, you already computed the full distance matrix. Neither OPTICS not DBSCAN will now compute this again (nor just the lower half of it) - they will only iterate over this huge huge matrix because they cannot make any assumptions on it: not even that it is symmetric.Why do you think
eps=-1
is right? What aboutmin_samples
with OPTICS? If you don't choose the same parameters, you of course don't get similar results of OPTICS and DBSCAN.The result found by OPTICS with your parameters is correct. At
eps=-1
no points are neighbors, and withmin_samples=5
hence there are no clusters, all points should be labeled -1.