Hi have an unbalanced text dataset with around 60 number of output classes, out of which 1 class is already combination of 240 different classes clubbed by business as per requirement, not by similar nature. So the population distribution of classes looks like:
| Class | Population |
|---|---|
| Class 1 | 56% |
| Class 2 | 16% |
| Class 3 | 12% |
| Class 4 | 8% |
| Class 5 | 6% |
| ....... | ..... |
| Class 59 | 0.06% |
I tried multiple text pre processing approach followed by different classification algorithm, but highest precision/recall i received is 0.65/0.63.
So I want to further club similar classes using ML clustering up to 10 unique classes, and than will perform classification.
I have used kmeans and produced 10 cluster. which give out like below:
k=kmeans(10,n_init=10,random_state=42)
feature=k.fit(feature)
df['cluster']=k.labels_
df['class']['cluster'].value_counts(normalize)
"O/P is like:"
class1
cluster_0: 0.40
_6: 0.20
_2: 0.16
class2
cluster_0: 0.46
_6: 0.25
_2: 0.15
and so on
How to map 10 clusters in to 10 unique class names.in other world clubbing similar classes in to one. Should I increase number of clusters. Or any other approach should I follow to club number classes. Ouput Classes I am expecting
| Original | Expected |
|---|---|
| Class 1 | Class1 |
| Class 2 | Class2 |
| Class 3 | Class3 |
| Class 4 | Class2 |
| Class 5 | Class1 |
| Class 6 | Class1 |
| Class 7 | Class2 |
| Class 8 | Class3 |