Using callable metric for HDBSCAN*

1.4k Views Asked by At

I want to cluster some data with HDBSCAN*.

The distance is calculated as a function of some parameters from both values so if the data look like:

    label1 | label2 | label3
0    32        18.5     3
1    34.5      11       12
2    ..        ..      ..
3    ..        ..      ..

The distance between two samples will be something like:

def calc_dist(i,j)
     return 0.5 * dist_label1_func(data.iloc[i]['label1],data.iloc[j]['label1] + 
     0.4 * dist_label2_func(data.iloc[i]['label2],data.iloc[j]['label2] +
     0.1 * dist_label3_func(data.iloc[i]['label3],data.iloc[j]['label3]

I can't calculate distance matrix due to the size of the data so it seems that callable is my only option.

my code looks like:

clusterer = hdbscan.HDBSCAN(metric=calc_dist).fit(i=i,j=j)

ERROR: fit() got an unexpected keyword argument 'i

clusterer = hdbscan.HDBSCAN(metric=calc_dist).fit(i,j)

ERROR: ValueError: Expected 2D array, got scalar array instead:array=4830. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

and it doesn't work, I tried also to change the parameter inside fit to the original dataset name:

clusterer = hdbscan.HDBSCAN(metric=calc_dist).fit(data)

ERROR: raise ValueError("Found array with dim %d. %s expected <= 2.") ValueError: setting an array element with a sequence.

but it can't accept it as well.

What do I miss?

1

There are 1 best solutions below

2
On

Usually the callback would get vectors, not row indexes.

So maybe you can use:

def calc_dist(i,j):
     return 0.5 * dist_label1_func(i['label1],j['label1]) + 
            0.4 * dist_label2_func(i['label2],j['label2]) +
            0.1 * dist_label3_func(i['label3],j['label3])

clusterer = hdbscan.HDBSCAN(metric=calc_dist).fit(data.iloc)

or if types get into your way, you can use an array of indexes with your original distance:

def calc_dist(i,j):
     i, j = int(i[0]), int(j[0])
     return 0.5 * dist_label1_func(data.iloc[i]['label1],data.iloc[j]['label1]) + 
            0.4 * dist_label2_func(data.iloc[i]['label2],data.iloc[j]['label2]) +
            0.1 * dist_label3_func(data.iloc[i]['label3],data.iloc[j]['label3])

range = np.arange(0, len(data), dtype=np.int8).reshape(-1, 1)
clusterer = hdbscan.HDBSCAN(metric=calc_dist).fit(range)

Beware that the performance of sklearn and similar python libraries with custom metrics can be very poor. Numba may or may not help improving the performance; for small enough data it usually is advisable to compute a pairwise matrix and use metric="precomputed", because it usually is much easier to write efficient code to precompute the matrix (and use the existing efficient code for handling a precomputed matrix), rather than getting a custom metric efficient inside the library code. This is because python is interpreted; so every distance computation needs to go through the interpreter. Languages such as Java with a powerful JIT compiler are often better at optimizing such cases.