I am trying to apply KMeans(Scikit-learn) on below mentioned data.
.
I have seen enough example where Float64 values are shown in cluster. What I would like to know is If clustering is possible on column df[[Description ]], having the x and y axis as Longitude and Latitude.
My code looks like this.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
from sklearn.preprocessing import LabelEncoder
import pandas as pd
matplotlib.style.use('ggplot')
df = pd.read_csv('df.csv')
encoder =LabelEncoder()
Longitude = encoder.fit_transform(df.Longitude)
Latitude= df[df.columns[19]].values #(latitude)
x=np.array([Longitude, Latitude]).T
est = KMeans(3)
est.fit(df[['Longitude', 'Latitude', 'Description']])
But the error I get on this line is
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () ----> 1 est.fit(df[['Longitude', 'Latitude', 'Description']])
c:\users\magiri\appdata\local\programs\python\python35-32\lib\site-packages\sklearn\cluster\k_means_.py in fit(self, X, y) 878 """ 879 random_state = check_random_state(self.random_state) --> 880 X = self._check_fit_data(X) 881 882 self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ = \
c:\users\magiri\appdata\local\programs\python\python35-32\lib\site-packages\sklearn\cluster\k_means_.py in _check_fit_data(self, X) 852 def _check_fit_data(self, X): 853 """Verify that the number of samples given is larger than k""" --> 854 X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32]) 855 if X.shape[0] < self.n_clusters: 856 raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
c:\users\magiri\appdata\local\programs\python\python35-32\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 380 force_all_finite) 381 else: --> 382 array = np.array(array, dtype=dtype, order=order, copy=copy) 383 384 if ensure_2d:
ValueError: could not convert string to float: 'GAME/DICE'
So, what I want to know is df.Description cluster with reference to Longitude and Latitude. I know Description column has string values which is why I am getting the error. Is there anyway I can avoid this error and can see clustering of Description column.
K-mean algorithm only works with numeric data. You could apply
OneHotEncoder
to your "Description" and "Location Description" fields to transform it one-hot-encoded representation. If your Description has some hierarchical values usingCountVectorizer
with a custom tokenizer could also be worth trying.To make sure Lattitude / Longitude doesn't outweigh the other fields in the Euclidean distance you can apply
StandardScaler
on your data prior to K-means.