Perform K-means Clustering in RapidMiner - Dealing with Categorical Variables

46 Views Asked by Tiago At 10 June 2023 at 15:32

I have a table customer_segmentation where each entry corresponds to a unique customer, and the structure of the table is as follows:

customer_id: (integer) Unique key to identify each customer

gender: (varchar(10)) Values are Male or Female, with some entries missing

age: (varchar(255)) Values are in range format like ]45;55] or ]55;65]

family_size: (tinyint) Values are integers (0, 1, 2, etc.)

lifestyle: (varchar(50)) Values are 'Quality', 'Family', 'No_Value', and 'Price'

lifestage: (varchar(50)) Values are 'Active Adults', 'Family with kids', 'Senior', 'Family Supporters', 'Family w/ Young Adults', and 'No_Value'

total_gross_sales: (decimal(10,2)) Related to total gross sales per customer

total_qty: (int) Related to total quantity bought per customer

average_order_value: (decimal(10,2)) Related to average order value per customer

purchase_frequency: (int) Related to the number of shopping trips per customer

last_purchase_date: (char(8)) Related to the last date a customer made a purchase

average_quantity_purchased: (decimal(10,2)) Related to average quantity purchased per customer

Sample from the table:

Note: There are missing values as you can see in the 3rd row and values like family_size=0 which doesn't seem correct, but this was the dataset I was asked to work on.

customer_id	gender	age	family_size	lifestyle	lifestage	total_gross_sales	total_qty	average_order_value	purchase_frequency	last_purchase_date	average_quantity_purchased
174545	F	]45;55]	4	Qualidade	Active Adults	61.31	28	2.67	2	20221115	1.21
179359	M	]25;35]	3	Família	Family with Kids	554.74	316	2.23	12	20221230	1.27
184627		Sem Valor	0	Qualidade	Active Adults	41.21	13	3.17	2	20221128	0.96

I am looking to perform k-means clustering on this dataset using RapidMiner software. I built this workflow but I get very strange results for the performance evaluator:

Results:

Avg. within centroid distance_cluster_0: -18.171

Avg. within centroid distance_cluster_1: -21.879

Avg. within centroid distance_cluster_2: -27.797

Avg. within centroid distance_cluster_3: -15.743

Avg. within centroid distance_cluster_4: -19.461

Changing the K doesn't seem to reduce the centroid distance. To choose the K, using R programming language, I've found the inflection point using the elbow method, but I'm not sure if I can choose only the numeric variables and then on the clustering consider also the nominal variables, but converting them to numeric.

selecting numerical columns for kmeans clustering:

df_cluster <- customer_seg[c("total_gross_sales", 
                              "total_qty", "average_order_value", "purchase_frequency", 
                              "average_quantity_purchased")]

removing any NA values if present:

df_cluster <- na.omit(df_cluster)

standardizing the data:

df_cluster <- scale(df_cluster)

Finding the optimal number of clusters:

wss <- (nrow(df_cluster)-1)*sum(apply(df_cluster,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(df_cluster, 
                                      centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares", main = "Elbow method for determining number of clusters")

Original Q&A

Perform K-means Clustering in RapidMiner - Dealing with Categorical Variables

There are 0 best solutions below

Related Questions in R

Related Questions in CLUSTER-ANALYSIS

Related Questions in K-MEANS

Related Questions in DATA-MINING

Related Questions in RAPIDMINER

Trending Questions

Popular # Hahtags

Popular Questions