Cluster a list of geographic points by distance and constraints

2.4k Views Asked by At

I have a delivery app, and I want to group orders (each order has a lat and lng coordinates) by location proximity (linear distance) and constraints like max orders and max total products (each order has an amount of products) inside a group.

For proximity grouping I used DBSCAN

coordinates = [[lat,lng],[lat,lng]],[lat,lng]],[lat,lng]],[lat,lng]]]
distance_matrix = squareform(pdist(coordinates, (lambda u,v: haversine(u,v))))

#eps=0.1 => 100m radius, 50m linear
db = DBSCAN(eps=0.1, min_samples=2, metric='precomputed')
results = db.fit(distance_matrix)

How can I add constraints in this functionality?

Is there anyway of doing this by using something else than DBSCAN or HDBSCAN ?

2

There are 2 best solutions below

0
On

Unfortunately, I think the model you want should be developed from the scratch.

Your question can be modeled as following optimization model.

Objective function: minimize the number of clusters, K Constraint (1) The size of every cluster is equal or smaller than S (parameter) (2) The number of orders in every cluster is equal or smaller than O (parameter) (3) For a sample x and its cluster Ck, dist(x, Ck) be the minimum among dist(x, C1), dist(x, C2), ..., dist(x, CK).

Solving this problem would require a lot of efforts...

0
On

This is an interesting question. I suppose it can be done a bunch of different ways. Here is one solution for you to condsider.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv


df = pd.read_csv('C:\\your_path\\properties_2017.csv')
# df.head(10)
df = df.head(10000)

df.shape


df.dropna(axis=0,how='any',subset=['latitude','longitude'],inplace=True)

# Variable with the Longitude and Latitude
X=df.loc[:,['parcelid','latitude','longitude']]
X.head(10)

K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) 

for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis)

for i in range(len(kmeans))] # Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()

enter image description here

kmeans = KMeans(n_clusters = 10, init ='k-means++')
kmeans.fit(X[X.columns[1:3]]) # Compute k-means clustering.X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])centers = kmeans.cluster_centers_ # Coordinates of cluster centers.labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each pointX.head(10)

X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each pointX.head(10)

X.head(5)

X = X[['parcelid','cluster_label']]
X.head(5)


clustered_data = df.merge(X, left_on='parcelid', right_on='parcelid')
clustered_data.head(5)

centers = kmeans.cluster_centers_
print(centers)


X=df.loc[:,['parcelid','latitude','longitude']]
X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)

enter image description here
data = X labels = kmeans.labels_

plt.subplots_adjust(bottom = 0.1)
plt.scatter(data.iloc[:, 1], data.iloc[:, 2], c=kmeans.labels_, cmap='rainbow') 

for label, x, y in zip(labels, data.iloc[:, 1], data.iloc[:, 2]):
    plt.annotate(
        label,
        xy=(x, y), xytext=(-20, 20),
        textcoords='offset points', ha='right', va='bottom',
        bbox=dict(boxstyle='round,pad=0.5', fc='red', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))

plt.show()

# labels pointing to each data point (this is a big jumbled together; you should probably select fewer data points to analyze).

enter image description here

Reference:

https://levelup.gitconnected.com/clustering-gps-co-ordinates-forming-regions-4f50caa7e4a1

Data source:

https://www.kaggle.com/c/zillow-prize-1/data