I have a table customer_segmentation where each entry corresponds to a unique customer, and the structure of the table is as follows:
customer_id: (integer) Unique key to identify each customer
gender: (varchar(10)) Values are Male or Female, with some entries missing
age: (varchar(255)) Values are in range format like ]45;55] or ]55;65]
family_size: (tinyint) Values are integers (0, 1, 2, etc.)
lifestyle: (varchar(50)) Values are 'Quality', 'Family', 'No_Value', and 'Price'
lifestage: (varchar(50)) Values are 'Active Adults', 'Family with kids', 'Senior', 'Family Supporters', 'Family w/ Young Adults', and 'No_Value'
total_gross_sales: (decimal(10,2)) Related to total gross sales per customer
total_qty: (int) Related to total quantity bought per customer
average_order_value: (decimal(10,2)) Related to average order value per customer
purchase_frequency: (int) Related to the number of shopping trips per customer
last_purchase_date: (char(8)) Related to the last date a customer made a purchase
average_quantity_purchased: (decimal(10,2)) Related to average quantity purchased per customer
Sample from the table:
Note: There are missing values as you can see in the 3rd row and values like family_size=0 which doesn't seem correct, but this was the dataset I was asked to work on.
| customer_id | gender | age | family_size | lifestyle | lifestage | total_gross_sales | total_qty | average_order_value | purchase_frequency | last_purchase_date | average_quantity_purchased |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 174545 | F | ]45;55] | 4 | Qualidade | Active Adults | 61.31 | 28 | 2.67 | 2 | 20221115 | 1.21 |
| 179359 | M | ]25;35] | 3 | Família | Family with Kids | 554.74 | 316 | 2.23 | 12 | 20221230 | 1.27 |
| 184627 | Sem Valor | 0 | Qualidade | Active Adults | 41.21 | 13 | 3.17 | 2 | 20221128 | 0.96 |
I am looking to perform k-means clustering on this dataset using RapidMiner software. I built this workflow but I get very strange results for the performance evaluator:
Results:
Avg. within centroid distance_cluster_0: -18.171
Avg. within centroid distance_cluster_1: -21.879
Avg. within centroid distance_cluster_2: -27.797
Avg. within centroid distance_cluster_3: -15.743
Avg. within centroid distance_cluster_4: -19.461
Changing the K doesn't seem to reduce the centroid distance. To choose the K, using R programming language, I've found the inflection point using the elbow method, but I'm not sure if I can choose only the numeric variables and then on the clustering consider also the nominal variables, but converting them to numeric.
selecting numerical columns for kmeans clustering:
df_cluster <- customer_seg[c("total_gross_sales",
"total_qty", "average_order_value", "purchase_frequency",
"average_quantity_purchased")]
removing any NA values if present:
df_cluster <- na.omit(df_cluster)
standardizing the data:
df_cluster <- scale(df_cluster)
Finding the optimal number of clusters:
wss <- (nrow(df_cluster)-1)*sum(apply(df_cluster,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(df_cluster,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main = "Elbow method for determining number of clusters")
