Can the total Entropy of all clusters be greater than 1, after classification?

80 Views Asked by At

After doing k-means classification on a dataset (value of k = 3), I tried to find out the total entropy of all the clusters. (Total number of datapoints, or, the total length of the dataset was : 500)

My classification results:

Cluster 1:
Class: neutral, Count: 64, Pr(neutral): 0.30769
Class: positive, Count: 85, Pr(positive): 0.40865
Class: negative, Count: 59, Pr(negative): 0.28365

Entropy of Cluster: 1.566429

Cluster size: 208

Cluster 2:
Class: neutral, Count: 65, Pr(neutral): 0.363128
Class: positive, Count: 36, Pr(positive): 0.2011173
Class: negative, Count: 78, Pr(negative): 0.4357541

Entropy of Cluster: 1.5182706

Cluster size: 179

Cluster 3:
Class: neutral, Count: 39, Pr(neutral): 0.345132
Class: positive, Count: 30, Pr(positive): 0.265486
Class: negative, Count: 44, Pr(negative): 0.389380

Entropy of Cluster: 1.56750289

Cluster size: 113

Total Entropy: 1.549431124 (which is > 1)

It means, that the 1st cluster contains 3 different types (classes) of datapoints in it, (whereas, for a perfect cluster, it should have contained only 1 type of class) namely, in the 1st cluster, there were a total 208 data points, out of which, 64 of them belongs to the neutral class, 85 belongs to the positive and 59 belongs to the negative class, and so on for the other 2 clusters

I used the formula:

Entropy of a single Cluster

enter image description here

where: c is a classification in the set C of all classifications P(w_c) is probability of a data point being classified as c in cluster w.

enter image description here

where: |w_c| is the count of points classified as c in cluster w n_w is the count of points in cluster w

Total Entropy of a clustering

enter image description here

where:

enter image description here

is the set of clusters. H(w) is a single clusters entropy N_w is the number of points in cluster w N is the total number of points.

I used the above formula to calculate the total entropy of a clustering, the result I got was a value > 1. I thought entropies are supposed to lie between 0 and 1, still I got something > 1, I could not understand my fault here, was my calculation wrong ? (but I have used the formula as was supposed to be used), or I missed something in the formula, or such (you might as well check the results after a manual calculation yourselves)

1

There are 1 best solutions below

0
fucalost On

You're using Shannon Entropy, which measures uncertainty across a categorical distribution.

Because you have three classes, the maximum entropy possible is 1.585 (log2(3)).