Perform K-means Clustering in RapidMiner - Dealing with Categorical Variables

46 Views Asked by At

I have a table customer_segmentation where each entry corresponds to a unique customer, and the structure of the table is as follows:

customer_id: (integer) Unique key to identify each customer

gender: (varchar(10)) Values are Male or Female, with some entries missing

age: (varchar(255)) Values are in range format like ]45;55] or ]55;65]

family_size: (tinyint) Values are integers (0, 1, 2, etc.)

lifestyle: (varchar(50)) Values are 'Quality', 'Family', 'No_Value', and 'Price'

lifestage: (varchar(50)) Values are 'Active Adults', 'Family with kids', 'Senior', 'Family Supporters', 'Family w/ Young Adults', and 'No_Value'

total_gross_sales: (decimal(10,2)) Related to total gross sales per customer

total_qty: (int) Related to total quantity bought per customer

average_order_value: (decimal(10,2)) Related to average order value per customer

purchase_frequency: (int) Related to the number of shopping trips per customer

last_purchase_date: (char(8)) Related to the last date a customer made a purchase

average_quantity_purchased: (decimal(10,2)) Related to average quantity purchased per customer

Sample from the table:

Note: There are missing values as you can see in the 3rd row and values like family_size=0 which doesn't seem correct, but this was the dataset I was asked to work on.

customer_id gender age family_size lifestyle lifestage total_gross_sales total_qty average_order_value purchase_frequency last_purchase_date average_quantity_purchased
174545 F ]45;55] 4 Qualidade Active Adults 61.31 28 2.67 2 20221115 1.21
179359 M ]25;35] 3 Família Family with Kids 554.74 316 2.23 12 20221230 1.27
184627 Sem Valor 0 Qualidade Active Adults 41.21 13 3.17 2 20221128 0.96

I am looking to perform k-means clustering on this dataset using RapidMiner software. I built this workflow but I get very strange results for the performance evaluator:

enter image description here

Results:

Avg. within centroid distance_cluster_0: -18.171

Avg. within centroid distance_cluster_1: -21.879

Avg. within centroid distance_cluster_2: -27.797

Avg. within centroid distance_cluster_3: -15.743

Avg. within centroid distance_cluster_4: -19.461

Changing the K doesn't seem to reduce the centroid distance. To choose the K, using R programming language, I've found the inflection point using the elbow method, but I'm not sure if I can choose only the numeric variables and then on the clustering consider also the nominal variables, but converting them to numeric.

selecting numerical columns for kmeans clustering:

df_cluster <- customer_seg[c("total_gross_sales", 
                              "total_qty", "average_order_value", "purchase_frequency", 
                              "average_quantity_purchased")]

removing any NA values if present:

df_cluster <- na.omit(df_cluster)

standardizing the data:

df_cluster <- scale(df_cluster)

Finding the optimal number of clusters:

wss <- (nrow(df_cluster)-1)*sum(apply(df_cluster,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(df_cluster, 
                                      centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares", main = "Elbow method for determining number of clusters")
0

There are 0 best solutions below