Clustering nominal data

951 Views Asked by At

I am trying to apply a clustering algorithm to my data set. My data set is of movies, and some of the attributes are nominal. for example:

movie 1:
[
IMDB popularity: 1.02
Genre: Drama
Sub-genre: Horror
Rating: 1.23%
]

movie 2:
[
IMDB popularity: 2.08
Genre: Comedy
Sub-genre: Animation
Rating: 0.72%
]

etc. etc.

Can I apply something similar to K-means? K-means works on distance, if I will just label, for example, "Drama" as 0, "Horror" as 1 "Comedy" as 2 and "Animation" as 3 - then what I'm actually saying is that for example "Drama" is more closely related to "Horror" then to "Comedy" (for this example it may be somehow close to reality, but for the general case it's very hard to label words as numbers and to maintain the real ratios. Any known algorithm that addresses this problem?

1

There are 1 best solutions below

3
On

The traditional solution in statistics to your specific problem is to code the values as different variables:

  • IsHoror
  • IsComedy . . .

Then you can run k-means on the results.

I would make two comments. First, be sure that you normalize the values in some ways (standardization or standardized principal components are two typical methods).

I am more fond of expectation-maximization clustering, which is a continuous variant of k-means, because it often produces better results.