Minimum reproducible example:
import cudf
from cuml.neighbors import KNeighborsRegressor
d = {
'id':['a','b','c','d','e','f'],
'latitude':[50,-22,13,37,43,14],
'longitude':[3,-43,100,27,-4,121],
}
df = cudf.DataFrame(d)
knn = KNeighborsRegressor(n_neighbors = 4, metric = 'haversine')
knn.fit(df[['latitude','longitude']],df.index)
dists, nears = knn.kneighbors(df[['latitude','longitude']], return_distance = True)
Throws an error number of landmark samples must be >= k
the whole trace is:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_33/1073358290.py in <module>
10 knn = KNeighborsRegressor(n_neighbors = 4, metric = 'haversine')
11 knn.fit(df[['latitude','longitude']],df.index)
---> 12 dists, nears = knn.kneighbors(df[['latitude','longitude']], return_distance = True)
/opt/conda/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner_get(*args, **kwargs)
584
585 # Call the function
--> 586 ret_val = func(*args, **kwargs)
587
588 return cm.process_return(ret_val)
cuml/neighbors/nearest_neighbors.pyx in cuml.neighbors.nearest_neighbors.NearestNeighbors.kneighbors()
cuml/neighbors/nearest_neighbors.pyx in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors()
cuml/neighbors/nearest_neighbors.pyx in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors_dense()
RuntimeError: exception occured! file=_deps/raft-src/cpp/include/raft/spatial/knn/detail/ball_cover.cuh line=326: number of landmark samples must be >= k
Obtained 64 stack frames
...
I have been trying hard to get around this error for days but the only way i know is to convert the cudf to pandas df and use sklearn. And it works perfectly:
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
d = {
'id':['a','b','c','d','e','f'],
'latitude':[50,-22,13,37,43,14],
'longitude':[3,-43,100,27,-4,121],
}
df = pd.DataFrame(d)
knn = KNeighborsRegressor(n_neighbors = 4, metric = 'haversine')
knn.fit(df[['latitude','longitude']],df.index)
dists, nears = knn.kneighbors(df[['latitude','longitude']], return_distance = True)
dists
gives us the distances array Can you help me find a pure RAPIDS solution?
UPDATE: I found out that it works for number of neighbors <= length of the total data//2
UPDATE: Its a bug, and an appropriate issue has been opened here. We can pass algorithm='brute'
as a work around until the issue gets resolved