I work with Scikit-Learn's nearest neighbors/radius classification with a precomputed metric. That means, I pass a n_samples_train x n_samples_train matrix of pairwise distances to the classifier's fit-method.
Now I wonder why this has to be done. Learning with knn simply should mean "store the samples", but the computation of the distances should only take place later on, during generalization (during that step, I of course calculate a distance matrix between my training samples and my test samples, so a matrix of size n_samples_train x n_samples_test).
In the case of SVM, for example, I pass a precomputed matrix (the Gramian, a similarity matrix) to the fit method of the smv.SVC-object. Then the optimization stuff takes place, the support vectors are found and so on. There, that matrix is absolutely necessary during training.
But I can see no reason why there is the need for a precomputed matrix with fitting in neighbors/radius classification.
Can someone give me a relevant hint, please?
I'd love to skip the calculation of a training matrix for knn with scikit learn.
Best greetings and thank you. :-)
This is old, but I happened to find it when searching for a related issue.
Essentially, it's a matter of performance. Take the case where you fit the k neighbors/radius classifier once and then use it to classify multiple different sets of test points. If the kernel matrix is not precomputed then it would have to calculate the kernel matrix every time you called fit(). The way that those classifiers are implemented takes advantage of the fact that you're working with a positive (semi)definite function and can use that to speed up the nearest neighbor/radius searches for new points using a kd-tree or ball tree, which builds a structure that puts bounds on the distances to points outside of each subtree. The construction of such a structure can be done in iirc O(k*log(n)) time for n samples and k neighbors (at least for the ball tree). Thus, by doing some work ahead of time, the classification of new points can be dramatically sped up.
To answer your question with a practical solution, you don't need to pass a precomputed distance matrix if you want to use a custom metric. If you pass a callable as the metric, the distance matrix will still be precomputed to a degree- but it will happen within the fitting procedure transparently and should in fact be more efficient than if you were brute force calculating the distances between all pairs of samples yourself (NB if you have sparse input, the classifier will still use brute force. It will still use multiple cores and therefore might be preferable to doing it yourself, but it will behave differently.)
So to summarize: you're absolutely right that the precomputed distance matrix is not strictly necessary for fitting the general k nearest neighbors classifier. However, by precomputing it- whether you do it or you pass a callable- the subsequent classification is much more efficient. Sklearn evidently made the choice to precompute for custom metrics- likely because the overhead with using a python function n*(n-1)/2 times makes that route much slower than using the highly optimized built in metrics, many of which are partly or wholly implemented in cython. But you do not need to calculate it as an explicit step before fitting.