I don't fully understand how the triangular inequality is used to optimise distance calculations in KNN classification.
I had written a python script referring the steps mentioned below
- Calculate the distance between each training pixel to the other.
- For each test sample
- Calculate the distance from the first training sample as dn. This would be the current minimum distance.
- Calculate the distance from the second training sample(p) as dp.
- If dp < dn assign dn =dp
- For each remaining training sample(c)
- If distance between the sample c and sample p measured as dcp meets
dp - dn < dcp < dp + dn
Calculate distance from test sample to the sample c as dp
If dp < dn, assign: dn = dp - Else, skip this training sample.
- If distance between the sample c and sample p measured as dcp meets
- Stop if there are no more training samples
- The class to which n belongs is the estimate.
Python Script:
def get_distance(p1 = (0, 0), p2 = (0, 0)):
return abs(p1[0] - p2[0]) + abs(p1[1] - p2[1])
def algorithm(train_set, new_point):
d_n = get_distance(new_point, train_set[0])
d_p = get_distance(new_point, train_set[1])
min_index = 0
if d_p < d_n:
d_n = d_p
min_index = 1
for c in range(2, len(train_set)):
dcp = get_distance(train_set[min_index], train_set[c])
if d_p - d_n < dcp < d_p + d_n:
d_p = get_distance(new_point, train_set[c])
if d_p < d_n:
d_n = d_p
min_index = c
print(train_set[min_index], d_n)
train_set = [
(0, 1, 'A'),
(1, 1, 'A'),
(2, 5, 'B'),
(1, 8, 'A'),
(5, 3, 'C'),
(4, 2, 'C'),
(3, 2, 'A'),
(1, 7, 'B'),
(4, 8, 'B'),
(4, 0, 'A'),
]
for new_point in train_set:
# Checking the distances from the points within training set iteself: min distance = 0, used for validation
result_point = min(train_set, key = lambda x : get_distance(x, new_point))
print(result_point, get_distance(result_point, new_point))
algorithm(train_set, new_point)
print('----------')
But it doesn't give the required result for 1 point.
Is my understanding of the optimization wrong?
Thank you in advance for any help.