I have a pandas dataframe defined as
Alejandro Ana Beatriz Jose Juan Luz Maria Ruben
Alejandro 0.0 0.0 1000.0 0.0 1037.0 1014.0 100.0 0.0
Ana 0.0 0.0 15.0 0.0 100.0 0.0 16.0 1100.0
Beatriz 1000.0 15.0 0.0 100.0 1000.0 1100.0 15.0 0.0
Jose 0.0 0.0 100.0 0.0 0.0 100.0 1000.0 14.0
Juan 1037.0 100.0 1000.0 0.0 0.0 1014.0 0.0 100.0
Luz 1014.0 0.0 1100.0 100.0 1014.0 0.0 0.0 0.0
Maria 100.0 16.0 15.0 1000.0 0.0 0.0 0.0 0.0
Ruben 0.0 1100.0 0.0 14.0 100.0 0.0 0.0 0.0
This dataframe contains compatibility measurements I want to group these people in groups of two or three people. This could be [Alejandro, Ana, Beatriz], [Jose, Juan, Luz], [Maria, Ruben].
To do that, I have to maximize the compatibility inside each group.
Can someone please point me towards a methodology?
It looks like you are starting off with a distance matrix rather than the original sample values.
AgglomerativeClusteringcan work with the distance matrix to group the samples into however many clusters you specify (other algorithms directly accepting a precomputed distance matrix includeDBSCAN,HDBSCAN,SpectralClustering, andOPTICS).In the code below, I ran
AgglomerativeClusteringon the data to get the cluster assigned to each name. Then, for visualisation, I represented the original distance matrix in 2D, and coloured the points by their assigned cluster.The data:
Perform the clustering:
Visualise in a 2D coordinate space:
Updated solution OP requires that cluster sizes are limited to 2 or 3, i.e. bounded between user-defined values. Initially I tried
HDBSCANas it accepts a min and max specification for cluster sizes, but it failed with this small dataset (more info at the bottom).My attempt below runs lots of
KMeanstrials to find a clustering that is suitable. It stops when it finds a clustering that contains no bad clusters (a bad cluster is where the size doesn't meet the user-defined spec). A downside of this approach is that the quality of the clustering might be poor or variable as we are relying on random initialisations of KMeans.Failed, but worth trying on more data: Initially I tried
HDBSCANas it takes both amin_cluster_size=andmax_cluster_size=parameter. However, it was flagging all the samples as anomalies. You might have better luck with a larger dataset: