ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). K means clustering

21 Views Asked by At

`I am doing a k means clustering project and I am getting the error when I onehotencoder on categorical values and apply standard scaler on numerical values.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The data is clean, no null values, no large values, outliers removed, no missing values

How do I correct this ?

my code below:

# Columns to be one-hot encoded
columns_to_onehot = ['gender', 'category', 'payment_method', ]

# Columns to be scaled
columns_to_scale = ['age', 'quantity', 'price', 'total_amount']
# One Hot Encoding
encoder = OneHotEncoder(drop='first', sparse=False) # 'drop' parameter is set to 'first' to avoid multicollinearity

#encoder = LabelEncoder()
one_hot_encoded_columns = encoder.fit_transform(subset_df1[columns_to_onehot])
#getting the column names
column_names = encoder.get_feature_names(input_features=columns_to_onehot)

df_encoded = pd.concat([subset_df1.drop(columns_to_onehot, axis=1),
                       pd.DataFrame(one_hot_encoded_columns, columns=column_names)],
                       axis=1)

# Standard Scaling


scaler = StandardScaler()
df_encoded[columns_to_scale] = scaler.fit_transform(df_encoded[columns_to_scale])

#Finding the optimal K with Elbow Method and Silhouette score

Sum_of_squared_distances = []
silhouette_avg = []

K = range(1,10)
for k in K:
    model = KMeans(n_clusters=k, random_state=0)
    model.fit(df_encoded)
    Sum_of_squared_distances.append(model.inertia_)
    
    if k>1:
        silhouette_avg.append(silhouette_score(df_encoded, model.labels_ ,metric='euclidean'))
       
    else:
        pass

enter image description here

enter image description here

0

There are 0 best solutions below