Is it possible to Cluster Non-float data in KMeans in Python(Scikit-Learn)?

3.6k Views Asked by At

I am trying to apply KMeans(Scikit-learn) on below mentioned data. Data .

I have seen enough example where Float64 values are shown in cluster. What I would like to know is If clustering is possible on column df[[Description ]], having the x and y axis as Longitude and Latitude.

My code looks like this.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
from sklearn.preprocessing import LabelEncoder
import pandas as pd
matplotlib.style.use('ggplot')

df = pd.read_csv('df.csv')

encoder =LabelEncoder()
Longitude = encoder.fit_transform(df.Longitude)
Latitude= df[df.columns[19]].values #(latitude)

x=np.array([Longitude, Latitude]).T

est = KMeans(3)

est.fit(df[['Longitude', 'Latitude', 'Description']])

But the error I get on this line is

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () ----> 1 est.fit(df[['Longitude', 'Latitude', 'Description']])

c:\users\magiri\appdata\local\programs\python\python35-32\lib\site-packages\sklearn\cluster\k_means_.py in fit(self, X, y) 878 """ 879 random_state = check_random_state(self.random_state) --> 880 X = self._check_fit_data(X) 881 882 self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ = \

c:\users\magiri\appdata\local\programs\python\python35-32\lib\site-packages\sklearn\cluster\k_means_.py in _check_fit_data(self, X) 852 def _check_fit_data(self, X): 853 """Verify that the number of samples given is larger than k""" --> 854 X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32]) 855 if X.shape[0] < self.n_clusters: 856 raise ValueError("n_samples=%d should be >= n_clusters=%d" % (

c:\users\magiri\appdata\local\programs\python\python35-32\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 380 force_all_finite) 381 else: --> 382 array = np.array(array, dtype=dtype, order=order, copy=copy) 383 384 if ensure_2d:

ValueError: could not convert string to float: 'GAME/DICE'

So, what I want to know is df.Description cluster with reference to Longitude and Latitude. I know Description column has string values which is why I am getting the error. Is there anyway I can avoid this error and can see clustering of Description column.

2

There are 2 best solutions below

4
On

K-mean algorithm only works with numeric data. You could apply OneHotEncoder to your "Description" and "Location Description" fields to transform it one-hot-encoded representation. If your Description has some hierarchical values using CountVectorizer with a custom tokenizer could also be worth trying.

To make sure Lattitude / Longitude doesn't outweigh the other fields in the Euclidean distance you can apply StandardScaler on your data prior to K-means.

0
On

I have successfully used kmodes and kprototypes to cluster categorical data. There is a python implementation here: https://github.com/nicodv/kmodes. Kmodes allows clustering categorical data and kprototypes clusters both categorical and numerical data (a mix of kmeans and kmodes). Sample usage from the github page

import numpy as np
from kmodes.kmodes import KModes

# random categorical data
data = np.random.choice(20, (100, 10))

km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

# Print the cluster centroids
print(km.cluster_centroids_)

Kmodes simply clusters based on common categories between points. A simplified summary of the distance measure for kprototypes is

distance = np.sum((a_num - b_num) ** 2) + gamma * np.sum(a_cat != b_cat)

where a_num and b_num are the numerical values of two points, and a_cat and b_cat are the categorical values. gamma is a weighting of the cost of categorical differences versus numerical distance. The default value is half the standard deviation of the numerical features (=0.5 if normalising numerical features beforehand).