Clustering nominal data

941 Views Asked by Binyamin Even At 31 July 2025 at 03:04

I am trying to apply a clustering algorithm to my data set. My data set is of movies, and some of the attributes are nominal. for example:

movie 1:
[
IMDB popularity: 1.02
Genre: Drama
Sub-genre: Horror
Rating: 1.23%
]

movie 2:
[
IMDB popularity: 2.08
Genre: Comedy
Sub-genre: Animation
Rating: 0.72%
]

etc. etc.

Can I apply something similar to K-means? K-means works on distance, if I will just label, for example, "Drama" as 0, "Horror" as 1 "Comedy" as 2 and "Animation" as 3 - then what I'm actually saying is that for example "Drama" is more closely related to "Horror" then to "Comedy" (for this example it may be somehow close to reality, but for the general case it's very hard to label words as numbers and to maintain the real ratios. Any known algorithm that addresses this problem?

Original Q&A

There are 1 best solutions below

Gordon Linoff On 17 January 2016 at 14:22

The traditional solution in statistics to your specific problem is to code the values as different variables:

IsHoror
IsComedy . . .

Then you can run k-means on the results.

I would make two comments. First, be sure that you normalize the values in some ways (standardization or standardized principal components are two typical methods).

I am more fond of expectation-maximization clustering, which is a continuous variant of k-means, because it often produces better results.

Clustering nominal data

There are 1 best solutions below

Related Questions in ALGORITHM

Related Questions in DATA-SCIENCE

Related Questions in CLUSTER-ANALYSIS

Related Questions in NOMINAL-DATA

Trending Questions

Popular # Hahtags

Popular Questions