machine learning, nominal data normalization

422 Views Asked by At

i am working on kmeans clustering . i have 3d dataset as no.days,frequency,food ->day is normalized by means & std deviation(SD) or better to say Standardization. which gives me range of [-2 to 14]

->for frequency and food which are NOMINAL data in my data sets are normalized by DIVIDE BY MAX ( x/max(x) ) which gives me range [0 to 1]

the problem is that the kmeans only considers the day-axis for grouping since there is obvious gap b/w points in this axis and almost ignores the other two of frequency and food (i think because of negligible gaps in frequency and food dims ).

if i apply the kmeans only on day-axis alone (1D) i get the exact similar result as i applied on 3D(days,frequency,food).

"before, i did x/max(x) as well for days but not acceptable"

so i want to know is there any way to normalize the other two nominal data of frequency and food and we get fair scaling based on DAY-axis.

food => 1,2,3 frequency => 1-36

2

There are 2 best solutions below

0
On BEST ANSWER

The point of normalization is not just to get the values small.

The purpose is to have comparable value ranges - something which is really hard for attributes of different units, and may well be impossible for nominal data.

For your kind of data, k-means is probably the worst choice, because k-means relies on continuous values to work. If you have nominal values, it usually gets stuck easily. So my main recommendation is to not use k-means.

For k-means to wprk on your data, a difference of 1 must be the same in every attribute. So 1 day difference = difference between food q and food 2. And because k-means is based on squared errors the difference of food 1 to food 3 is 4x as much as food to food 2.

Unless you have above property, don't use k-means.

0
On

You can try to use the Value Difference Metric, VDM (or any variant) to convert pretty much every nominal attribute you encounter to a valid numeric representation. An after that you can just apply standardisation to the whole dataset as usual.

The original definition is here:

http://axon.cs.byu.edu/~randy/jair/wilson1.html

Although it should be easy to find implementations for every common language elsewhere.

N.B. for ordered nominal attributes such as your 'frequency' most of the time it is enough to just represent them as integers.