Reduce High number of classes in to few by using clustering than perform classification

41 Views Asked by At

Hi have an unbalanced text dataset with around 60 number of output classes, out of which 1 class is already combination of 240 different classes clubbed by business as per requirement, not by similar nature. So the population distribution of classes looks like:

Class Population
Class 1 56%
Class 2 16%
Class 3 12%
Class 4 8%
Class 5 6%
....... .....
Class 59 0.06%

I tried multiple text pre processing approach followed by different classification algorithm, but highest precision/recall i received is 0.65/0.63.
So I want to further club similar classes using ML clustering up to 10 unique classes, and than will perform classification. I have used kmeans and produced 10 cluster. which give out like below:

k=kmeans(10,n_init=10,random_state=42)
feature=k.fit(feature)
df['cluster']=k.labels_
df['class']['cluster'].value_counts(normalize)

"O/P is like:"
class1
cluster_0: 0.40
       _6: 0.20
       _2: 0.16
class2
cluster_0: 0.46
       _6: 0.25
       _2: 0.15
and so on

How to map 10 clusters in to 10 unique class names.in other world clubbing similar classes in to one. Should I increase number of clusters. Or any other approach should I follow to club number classes. Ouput Classes I am expecting

Original Expected
Class 1 Class1
Class 2 Class2
Class 3 Class3
Class 4 Class2
Class 5 Class1
Class 6 Class1
Class 7 Class2
Class 8 Class3
0

There are 0 best solutions below